Large language models like ChatGPT are finding their way into medical clinics, patient forums, and doctor's offices at an accelerating pace. Patients use them to understand diagnoses. Physicians consult them for second opinions. Yet these powerful systems have a fundamental problem: they often give you different answers to the same medical question, sometimes wildly different ones.
A new study reveals something unexpected. The inconsistency may not be the AI's fault alone. How you phrase your question to the model matters enormously. The specific wording, structure, and framing of a prompt can swing accuracy rates from below 5 percent to above 60 percent for the same medical question.
The finding points to a hidden frontier in medical AI: teaching people how to talk to these systems effectively. It's a skill that may become as important as knowing which questions to ask your doctor.
The Reliability Crisis
When researchers tested various large language models against established medical guidelines for osteoarthritis, they found something troubling. GPT-4, the most capable model available, gave responses aligned with professional guidelines only 39 to 63 percent of the time. Other models performed even worse. More concerning, when researchers asked the same question multiple times, the models often generated different answers.
This inconsistency haunts clinical applications. A patient might receive reassuring advice one day and alarming information the next. A physician using AI to help diagnose a condition can't rely on consistent clinical reasoning. The internal uncertainty of these systems makes them unpredictable, regardless of their apparent sophistication.
Previous research had documented this problem. Studies showed GPT-4 got complex diagnoses right only 39 percent of the time. Med-PaLM, a version of Google's language model fine-tuned for medicine, generated inappropriate or incorrect content in 18 percent of responses. Yet little research had examined what could reliably improve performance across different models and different types of medical questions.
The Prompt Engineering Hypothesis
Over the past few years, researchers in computer science discovered something counterintuitive. The same AI model can dramatically improve at solving problems simply by changing how you ask it to solve them. This field of "prompt engineering" has become crucial for getting useful answers from large language models.
Different prompt structures trigger different reasoning patterns in these models. Some prompts ask the model to explain its thinking step by step. Others ask it to simulate multiple experts discussing a problem. Still others simply state the problem directly. Each approach can yield different results.
No one had carefully tested whether these prompt engineering techniques actually worked in medicine. Did they help? Did they help equally well across different models? Which methods worked best for medical questions specifically?
The Experiment
Researchers selected osteoarthritis as a test case. It's one of the most common and disabling diseases globally, affecting millions of elderly people. The condition requires complex management including pain control, physical therapy, lifestyle changes, and sometimes surgery. Both patients and physicians frequently search online for relevant information, making it an ideal domain for testing AI reliability.
The research team extracted 34 clinical recommendations from the American Academy of Orthopedic Surgeons osteoarthritis guidelines. Each recommendation had an assigned strength level: strong, moderate, limited, or consensus based on the quality of scientific evidence supporting it.
They tested nine different large language model versions, including multiple versions of GPT-4, GPT-3.5, and Google Bard. The researchers applied four different prompt structures to each model.
The simplest prompt, called input-output or IO prompting, stated the problem directly: "Consider this medical advice. Rate it using these criteria."
The more sophisticated prompts used chain of thought techniques. One version, zero-shot chain of thought, asked the model to "think step by step." Another, performed chain of thought, explicitly broke down the task into numbered steps.
The most complex approach, called reflection of thoughts or ROT prompting, instructed the model to imagine three medical experts working through the problem independently, then discussing their reasoning together and revising their answers based on discussion. This approach asked the model to simulate a process of expert deliberation and consensus building.
Each question was asked five times to assess reliability. In total, the researchers collected 680 responses per model, testing whether the same question would consistently produce the same answer.
What the Best Prompt Can Do
The results showed dramatic variability. GPT-4 accessed through a web interface with ROT prompting achieved a 62.9 percent consistency rate overall. For strong recommendations, ROT prompting achieved 77.5 percent consistency, matching or exceeding other approaches.
Other combinations performed far worse. Some fell below 5 percent consistency. The difference between the best and worst performing approach for the same model sometimes exceeded 50 percentage points.
The top 10 most consistent prompt and model combinations all included either GPT-4 or fine-tuned versions of GPT-3.5. ROT prompting consistently ranked as the top choice for GPT-4 models. For other models, different prompts worked better, suggesting no single approach works universally.
The research also revealed something surprising about fine-tuning, a technique where researchers train models on specific medical data. Fine-tuning improved GPT-3.5 performance when tested with the same IO prompt used in training, reaching 55.3 percent consistency. But when users tried different prompt structures with the fine-tuned model, performance actually deteriorated to as low as 22 percent.
This suggests fine-tuning alone is insufficient. The model learned to respond to a specific question format but didn't develop deeper understanding of the underlying medical concepts.
The Reliability Problem
Beyond consistency, the study measured reliability: whether the same model would generate the same response when asked the same question multiple times.
The results were sobering. Reliability varied wildly. Some model and prompt combinations showed nearly perfect reliability, producing identical answers across five repetitions. Others showed almost no reliability at all, with kappa values ranging from negative 0.002 to 0.984 on a scale where higher values indicate better consistency.
Only IO prompting with GPT-3.5 models set to temperature 0, a technical parameter that reduces randomness, achieved nearly perfect reliability. Most other combinations showed fair to moderate reliability at best.
Temperature settings proved crucial. Temperature controls how randomly the model generates text. A temperature of 0 makes the model highly deterministic and consistent. Higher temperatures introduce randomness. The research found that GPT-4 accessed through a website performed better than the same model accessed through an API, even though both use the same underlying system.
Why It Matters
The study reveals that large language models cannot yet be trusted for medical decisions without careful attention to how questions are posed. The same model can swing from failing a question to acing it based purely on prompt structure.
For patients seeking health information online, the implications are straightforward but uncomfortable. You cannot rely on AI systems to give you consistent, reliable medical advice, no matter how sophisticated they appear. The answers you receive today might differ from tomorrow's answers to the same question.
For physicians considering AI assistance in clinical practice, the findings suggest that implementing these systems requires more than simply deploying a model. Success depends on using appropriate prompts, understanding which models work best for which tasks, and recognizing that even then, consistency remains imperfect.
The research hints at something deeper: current large language models don't truly understand medical principles. They pattern-match against training data. ROT prompting works better not because it triggers genuine reasoning but because it encourages the model to simulate a process that mimics human expert deliberation, producing more reliable outputs.
The Path Forward
The researchers suggest that asking language models the same question multiple times and comparing answers might reveal the most reliable response. They also advocate for developing prompt engineering guidelines specifically tailored for patients and physicians, not just generic instructions.
Future work should combine multiple approaches. Prompt engineering alone won't solve the problem. Improvements in model architecture, parameter tuning, and specialized training all play roles. The most effective medical AI systems will likely combine optimized prompts with better base models and techniques like fine-tuning implemented more carefully.
The study opens a new frontier in medical AI. Before these systems can safely assist in clinical decisions, we need to understand not just how good the models are, but how to coax better performance from them. Prompt engineering may offer a bridge. It's a technique that requires no changes to the underlying AI system, only changes in how humans communicate with it.
Until that work is done, both patients and physicians should remember a simple rule: never trust an AI system's answer to a medical question without verification from authoritative sources or professional consultation. The technology is not yet ready to stand alone, even when it appears confident. How we talk to AI matters. But talking to a human expert matters more.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s41746-024-01029-4
Medical Disclaimer: This article is for informational and educational purposes only and does not constitute medical advice, diagnosis, or treatment. Always seek the advice of your physician or another qualified health provider with any questions you may have regarding a medical condition. Never disregard professional medical advice or delay in seeking it because of something you have read in this publication.






