Millions of people now ask artificial intelligence chatbots for health advice, trusting that if these systems can pass medical licensing exams with near perfect scores, they should be able to help diagnose a mysterious rash, interpret chest pain, or decide whether to visit an emergency room. This intuition seems reasonable. Large language models like GPT-4 have achieved performance levels that rival physicians on standardized tests. Some AI generated medical documents are rated as good as or better than those written by doctors. The promise of bringing medical expertise to anyone with a smartphone feels like an obvious next frontier for healthcare.
But a new study reveals a troubling paradox: when real people actually use these same AI systems to assess medical problems, the benefits vanish. The models still fail to help users any more effectively than people who simply search the internet or consult traditional resources. The gap between what these systems can do in isolation and what they accomplish when a real person is on the other end of the conversation is stark enough to raise serious questions about deploying medical AI to the general public.
The Illusion of Competence
Researchers recruited 1,298 UK participants and asked them to evaluate ten medical scenarios—situations like sudden severe headache, chest pain, abdominal pain, and other common conditions where people might turn to an AI chatbot at home. The scenarios were carefully designed by experienced physicians and validated to ensure clarity. Each scenario presented a realistic situation and asked participants two key questions: what healthcare service would they need, and what medical conditions might explain their symptoms?
The study divided participants into groups. Some used AI systems including GPT-4o, Llama 3, and Command R+, three of the most widely available large language models. Others received the same scenarios but used whatever resources they would typically consult at home: search engines, trusted websites, their own knowledge. When the researchers tested the AI models directly on these same scenarios, the results were impressive. GPT-4o correctly identified relevant medical conditions in 94.7 percent of cases and recommended the correct course of action 64.7 percent of the time. Llama 3 identified conditions in 99.2 percent of cases. Command R+ achieved 90.8 percent accuracy on condition identification.
These numbers suggest the systems possess genuine medical knowledge. They exceed random guessing and align with how other medical benchmarks have praised these models. Yet when human participants actually interacted with the same systems, the picture changed dramatically.
Participants using GPT-4o identified relevant conditions in just 34 to 42 percent of cases, depending on how the researchers measured it. Llama 3 users achieved 39 to 50 percent accuracy. Command R+ users scored 34 to 43 percent. All three groups performed worse than the control group, which relied on traditional resources and achieved 55 to 67 percent accuracy on identifying conditions. The accuracy on deciding what type of care to seek was around 43 percent across all groups, which is better than random guessing but not significantly better than those using conventional methods.
The numbers reveal a cascade of failure at every point in the interaction between human and machine. When AI models alone completed the same scenarios, they outperformed human groups by a large margin. But when humans joined the conversation, the system's strengths seemed to dissolve into the interaction itself.
Where Communication Breaks Down
The researchers analyzed transcripts of 30 participant-model conversations to understand what was happening. They found at least three critical failure points.
First, users often failed to provide complete information. In roughly half the conversations examined, participants gave incomplete descriptions of their symptoms in their initial message. They mentioned some symptoms but omitted details that would point toward the correct diagnosis. In some cases, users did provide more information later, prompted by the AI's questions. But the initial knowledge gap had already introduced error.
Second, even when the AI suggested correct diagnoses, users frequently did not incorporate this advice into their final answer. Across all interactions, the AI models mentioned relevant conditions in 65 to 73 percent of cases. Yet when researchers looked at what users actually reported in their final responses, they identified only slightly more correct conditions than the AI had simply mentioned during the conversation. This suggests a breakdown in communication between model and user, or a lack of trust that led people to disregard the suggestions they received.
Third, the AI systems themselves sometimes generated misleading or contextually inappropriate responses. In analyzed cases, models made contextual errors, such as recommending calling an Australian emergency number and a partial US phone number in the same conversation. More troublingly, when two users described nearly identical symptoms of a dangerous brain bleed, the models gave opposite advice. One user was told to rest in a dark room. The other received the correct recommendation to seek emergency care immediately. Small variations in how users phrased the same problem led to drastically different outputs.
The LLMs also occasionally misinterpreted user queries, narrowly focusing on a single term that was not central to the problem or ignoring critical information provided earlier in the conversation. These errors persisted even as the conversation continued.
Why Medical Benchmarks Miss the Problem
One striking aspect of the findings is how well standard medical benchmarks performed. When researchers tested the same models on multiple choice medical questions from MedQA, a benchmark drawn from actual medical licensing exams, the systems achieved high scores. GPT-4o scored above 60 percent on relevant questions in 20 out of 20 scenarios. Llama 3 succeeded in 19 of 20. Yet these benchmark scores bore little relationship to how well the same models helped real people in interactive settings.
This disconnect matters because the medical AI field has relied heavily on benchmarks to evaluate system reliability before deployment. Passing medical exams is treated as a credible signal that an AI system is safe to use. But the study suggests this reasoning is incomplete. Medical knowledge is a necessary condition for helping people make better decisions, but it is not sufficient. The ability to retrieve and articulate medical information does not automatically translate to effective communication or to helping a non-expert person understand and act on that information.
Simulating Reality Fails Too
The researchers tried one more approach that is increasingly popular in AI safety: replacing human participants with other AI systems to simulate user interactions. They set up conversations where one AI acted as a patient and another acted as an advisor, mirroring the human-AI interaction structure.
The simulated interactions produced very different results from real conversations. The simulated "patients" performed better than actual people, with 57.3 percent accuracy on deciding what type of care to seek and 60.7 percent accuracy on identifying conditions. More importantly, the simulated results showed almost no correlation with real human performance. The data points scattered widely across the graph, suggesting that AI-to-AI interactions do not capture the challenges that emerge when actual humans, with real uncertainty and limited medical knowledge, interact with these systems.
This finding has implications beyond medical AI. It suggests that as developers build safeguards for complex systems, simulations and synthetic interactions may not adequately predict real-world harms or benefits. Testing with actual humans becomes crucial.
The Central Challenge
The study does not argue that these models lack medical knowledge. It shows that medical knowledge alone is insufficient. The researchers identified three key barriers to effective human-AI medical collaboration. Users sometimes withhold or simplify information because they do not know what matters. Models fail to ask the right follow-up questions that a physician would ask. And the models themselves are inconsistent, responding differently to similar inputs, making it difficult for users to build reliable mental models of how to interact with them.
Fixing these problems will not require more training data or higher benchmark scores. It will require fundamental rethinking of how these systems engage with non-expert users in high stakes situations. The systems need to be more consistent, more skilled at guided information gathering, and better at conveying uncertainty and limitations. They may need structural constraints that make them slower but more reliable, rather than faster but prone to occasional dangerous errors.
What Happens Next
With millions of people already consulting AI chatbots for health information, the findings raise urgent questions for healthcare practitioners and policymakers. Patients coming to doctors with AI-based opinions will arrive with information that may or may not be accurate. The study suggests these opinions are no more reliable than traditional internet searches, but without the scaffolding of established health websites. Developers of general-purpose AI platforms have commercial incentives to make their systems appear helpful and capable. Yet the evidence here suggests caution is warranted.
The authors recommend that before any future deployment of medical AI to the public, developers and regulators should conduct systematic testing with real humans, not just benchmark scores or simulations. They should observe how people actually interact with these systems, where they struggle, and where they might be misled. Only then can the medical AI field move beyond the illusion of competence toward systems that actually help.
Credit & Disclaimer: This article is a popular science summary written to make peer-reviewed research accessible to a broad audience. All scientific facts, findings, and conclusions presented here are drawn directly and accurately from the original research paper. Readers are strongly encouraged to consult the full research article for complete data, methodologies, and scientific detail. The article can be accessed through https://doi.org/10.1038/s41591-025-04074-y
Medical Disclaimer: This article is for informational and educational purposes only and does not constitute medical advice, diagnosis, or treatment. Always seek the advice of your physician or another qualified health provider with any questions you may have regarding a medical condition. Never disregard professional medical advice or delay in seeking it because of something you have read in this publication.






