The chatbot provides an answer. It sounds confident. The explanation spans several paragraphs, citing facts and reasoning through the problem. You trust it. You should not.
New research reveals a troubling disconnect between what large language models actually know and what users believe they know. When asked to evaluate how likely an AI assistant was to answer questions correctly, study participants consistently overestimated accuracy by substantial margins. The models themselves knew their own limitations reasonably well. The problem lay in communication.
This calibration gap, as researchers term it, threatens to undermine trust in increasingly ubiquitous AI systems. As language models integrate into medicine, education, law, and public policy, users need accurate assessments of reliability. Getting confidence communication right matters as much as getting answers right.
THE TRUST PROBLEM
Large language models have developed an unfortunate reputation. They generate responses that sound authoritative while being completely wrong. The technical term is hallucination, though confabulation might describe it better. The models construct plausible sounding nonsense, mixing facts with fiction so smoothly that distinguishing truth from fabrication challenges even experts.
This unreliability prompted developers to add warnings. OpenAI cautions users against uncritical acceptance of ChatGPT outputs. The companies recognize that their models cannot always tell users what they know versus what they fabricate. But warnings alone prove insufficient. Users need more granular guidance: which specific answers deserve trust and which require verification.
Recent technical work showed that models possess internal mechanisms reflecting knowledge boundaries. When answering multiple choice questions, the probability a model assigns to selected answers tracks reasonably well with actual accuracy. Models also distinguish answerable from unanswerable questions and truthful statements from lies. This internal representation suggests models maintain some form of epistemic awareness.
However, this internal confidence remains hidden from users. Model outputs consist purely of language. Users must infer reliability from text alone, without access to numerical confidence scores. This creates opportunity for miscommunication. A model might assign 60 percent confidence internally while generating explanations that sound completely certain. Alternatively, it might generate tentative language despite 95 percent internal confidence.
The question becomes whether language models effectively convey their uncertainty through their natural language responses. The research examined this question directly.
MEASURING THE GAP
Researchers from University of California, Irvine designed experiments testing human perceptions of AI confidence. The setup was straightforward. Models answered questions. Participants read the explanations. Then participants estimated the probability that answers were correct.
The study used three language models: GPT 3.5, PaLM2, and GPT 4o. Questions came from two datasets. The MMLU dataset provided multiple choice questions across diverse knowledge domains, from high school physics to professional medicine. The Trivia QA dataset supplied short answer questions on history, culture, science, and entertainment.
For each question, researchers extracted model confidence by reading internal token probabilities. This technical approach allows direct calculation of how much probability the model assigns to different possible answers. A model highly confident in answer A might assign it 95 percent probability. One uncertain might split probability more evenly across options.
Importantly, participants never saw these numerical confidence scores. They saw only the language the model produced: the answer plus an explanation. This mirrors real world usage where model outputs consist entirely of text.
The experiments measured two key metrics. Expected calibration error assesses whether confidence scores match actual accuracy. Perfect calibration means 70 percent confidence predictions prove correct 70 percent of the time. Any deviation indicates miscalibration. The area under curve metric evaluates discrimination: how well confidence scores separate correct from incorrect answers.
Results showed substantial gaps on both metrics. Model confidence demonstrated reasonable calibration and good discrimination. Human confidence based on reading model explanations showed neither.
THE OVERCONFIDENCE PROBLEM
When participants read default model explanations and estimated accuracy, they consistently judged models as more accurate than reality. For GPT 3.5 answering multiple choice questions with 52 percent accuracy, participants estimated accuracy would exceed 70 percent. For PaLM2 with 47 percent accuracy, participants estimated over 65 percent. For GPT 4o answering short answer questions with 63 percent accuracy, participants estimated roughly 75 percent.
This overconfidence appeared across question types and models. Default explanations systematically misled users about reliability. The calibration diagrams showed the problem visually. Ideal calibration produces dots along the diagonal. Human confidence based on default explanations showed dots consistently below the diagonal, indicating overconfidence throughout the confidence range.
The distribution of confidence ratings revealed another issue. Participants frequently provided high confidence ratings even for questions the model answered incorrectly. Rather than spreading confidence across the full range, participants clustered ratings at the high end. This suggests shallow processing of explanation content.
The discrimination results proved equally concerning. Models internally distinguished correct from incorrect answers well above chance levels. Area under curve values exceeded 0.75 for all three models. Participants reading default explanations barely exceeded random guessing, with area under curve values around 0.59 to 0.60.
This discrimination gap means users could not tell which answers deserved trust and which required skepticism. The model knew. Users did not. Whatever internal representation allowed models to assess their own knowledge failed to transmit through language.
LONGER MEANS NOTHING
The second experiment manipulated explanation characteristics. Researchers created nine explanation variants by crossing three confidence levels (low, medium, high) with three length levels (long, short, uncertainty only).
Confidence levels affected human ratings strongly. Explanations beginning "I am not sure" produced significantly lower confidence than those beginning "I am somewhat sure," which in turn produced lower confidence than "I am sure." Users correctly interpreted explicit uncertainty language.
Length effects told a different story. Longer explanations increased user confidence compared to short explanations, which increased confidence compared to uncertainty only statements. This happened despite longer explanations containing no additional information helping users discriminate correct from incorrect answers.
Statistical analysis confirmed this. Across all three models and both question types, mean participant discrimination (area under curve) measured 0.54 for long explanations and 0.57 for uncertainty only explanations. The difference favored the shorter version. Extra length added nothing useful.
This length bias parallels findings in social psychology. Longer arguments often seem more persuasive regardless of actual content. People mistake quantity for quality. The same bias appears in peer review, where longer reviews seem more thorough despite equivalent information density. Language models inherit or amplify this human tendency.
The explanation seems straightforward. After committing to an answer, autoregressive models generate subsequent tokens maximizing likelihood of that answer. This produces confident sounding rationales even when initial answer selection involved substantial uncertainty. The generation process inflates perceived confidence above actual confidence.
This mirrors choice supportive bias in human psychology. After making decisions, people tend to inflate the desirability of chosen options while downplaying rejected alternatives. The post hoc rationalization makes decisions seem better than they were. Language models appear to exhibit analogous behavior algorithmically.
CLOSING THE GAP
Having established that default explanations mislead users, the researchers tested whether modified explanations could improve calibration and discrimination. The approach was simple. Select explanation style based on model confidence. Show low confidence explanations when model confidence is low, medium when medium, high when high.
This selection rule used confidence thresholds to bin questions into three categories. For questions where models assigned low confidence to their selected answer, show the "I am not sure" explanation variant. For medium confidence, show "I am somewhat sure." For high confidence, show "I am sure."
Applying this rule dramatically narrowed both calibration and discrimination gaps. Calibration error for human confidence decreased substantially across all three models compared to default explanations. The calibration diagrams showed dots moving much closer to the ideal diagonal.
Discrimination improvements proved equally striking. Area under curve values increased from around 0.59 to 0.60 for default explanations to around 0.65 to 0.70 when explanation style matched model confidence. Users became significantly better at distinguishing probably correct from probably incorrect answers.
The results demonstrate that bridging the confidence gap is achievable through prompt engineering. Models already possess reasonable internal calibration. The challenge lies in communication. By aligning verbal uncertainty expressions with internal confidence, systems can convey reliability more accurately.
Importantly, this improvement happened without changing the underlying model or adding new capabilities. The same model that generated misleading default explanations generated properly calibrated explanations. Only the prompt changed. This suggests immediate practical applications for deployed systems.
KNOWLEDGE WITHOUT EXPERTISE
The multiple choice experiments included an additional component. After evaluating model confidence, participants answered questions themselves with model assistance. This revealed whether participants possessed independent knowledge allowing them to improve on model answers.
Results showed minimal independent knowledge. Participant accuracy on questions matched model accuracy almost exactly: 51 percent versus 52 percent for GPT 3.5, and 45 percent versus 47 percent for PaLM2. In 82 percent of cases, participants selected the same answer the model provided.
When participants did change the answer, accuracy dropped to 33 percent compared to 39 percent model accuracy for those same questions. Participants typically made things worse, not better. This confirms that the MMLU questions exceeded typical participant knowledge, as intended. Prior research showed that crowdworkers without specialized training scored 35 percent accuracy on similar questions.
Interestingly, aligning explanation style with model confidence did not improve participant answer accuracy. Users became better at recognizing when models were probably right or wrong, but lacked knowledge to override incorrect model answers with correct alternatives. Accurate uncertainty communication allows users to recognize unreliability but cannot substitute for missing knowledge.
The study also collected expertise self assessments. At the end of experiments, participants estimated their own accuracy on similar questions for each topic. Median self assessed expertise ranged from 30 percent for high school physics to 45 percent for world history. These realistic self assessments showed participants recognized their limitations.
However, perceived expertise did not improve performance. Researchers split participants into high and low expertise groups based on self ratings. The high expertise group showed marginally better discrimination (area under curve 0.600 versus 0.579) but the difference lacked statistical significance. Calibration error proved equivalent between groups. Believing oneself knowledgeable did not enable better assessment of model reliability.
IMPLICATIONS FOR DESIGN
These findings carry immediate implications for AI assistant design. Current practice presents model outputs without explicit confidence information. Users rely entirely on textual cues. As demonstrated, default textual explanations systematically mislead users about reliability.
The solution appears straightforward: modify prompts to align explanation uncertainty with model confidence. This requires minimal technical infrastructure. Systems already compute confidence scores for internal purposes. Routing these scores to prompt selection represents a small addition.
Implementation might work as follows. Compute model confidence for the selected answer. Bin confidence into categories (low, medium, high). Select from multiple explanation templates matching the confidence category. Generate the explanation. Present to user.
This approach preserves natural language interaction while improving calibration. Users continue seeing purely textual outputs. The text now conveys uncertainty more accurately. No additional interface elements or numerical displays are needed.
Alternative approaches might incorporate explicit probability statements. Instead of verbal phrases like "I am not sure," models could state "I assign 45 percent confidence to this answer." Research on human interpretation of numerical probabilities suggests mixed effectiveness. People interpret the same percentage differently depending on context. Verbal probability phrases, despite imprecision, may communicate more naturally.
The length bias finding suggests another design principle: avoid unnecessarily verbose explanations. Current models often generate long responses, perhaps because human feedback during training rewarded detailed answers. However, length without content increases perceived confidence without improving accuracy. Concise explanations matching actual uncertainty serve users better.
Some applications might benefit from uncertainty only explanations for low confidence cases. When models lack confidence, simply stating "I am not sure" without elaboration prevents false confidence from lengthy rationalization. Users can decide whether to trust uncertain answers without being swayed by persuasive but unreliable reasoning.
UNANSWERED QUESTIONS
The research focused on specific question types: multiple choice and short answers. Whether findings generalize to longer form outputs, creative writing, or dialogue remains unclear. Open ended generation allows greater flexibility in expressing uncertainty. It also allows more ways to mislead users.
Another limitation involves the prompt modification approach. The study required prompting models twice: once to extract answer and confidence, again to generate confidence modified explanations. Single pass generation would prove more efficient. Research exploring how to elicit calibrated explanations in one step would advance practical deployment.
The fundamental cause of miscommunication deserves investigation. Why do models with reasonable internal calibration generate overconfident explanations? Two hypotheses emerged. First, reinforcement learning from human feedback may introduce bias. If human evaluators prefer confident detailed explanations, models learn to produce them regardless of actual certainty.
Second, the autoregressive generation process itself may inflate confidence. After committing to an answer token, the model generates subsequent tokens maximizing likelihood of that answer. This optimization pressure creates rationalization. The model constructs the most convincing possible explanation for its answer, even when the answer was tentative.
Testing these hypotheses requires examining training procedures and generation mechanisms. If human feedback causes the problem, modifying reward functions during training might help. If autoregressive generation causes it, alternative generation methods might work better. Understanding causes would enable more robust solutions.
The research also raises questions about different model architectures. The study used established large language models trained with standard techniques. Newer training approaches emphasizing accuracy over persuasiveness might generate better calibrated default explanations. Constitutional AI and other alignment methods might reduce calibration gaps.
EXPERTISE AND VERIFICATION
The knowledge results highlight a crucial limitation. Accurate uncertainty communication helps users recognize unreliability but cannot replace missing knowledge. In domains where users lack expertise, AI assistance raises new challenges.
The ideal scenario involves expert users assisted by AI. Medical doctors using AI diagnostic aids can evaluate suggestions using professional knowledge. They can catch errors. They can override incorrect recommendations. Accurate confidence communication helps by focusing attention on uncertain cases requiring extra scrutiny.
The concerning scenario involves novice users relying on AI without expertise to evaluate outputs. Students using AI to complete assignments lack knowledge to recognize subtle errors. Patients seeking medical information online cannot distinguish accurate from inaccurate health advice. Accurate confidence communication helps but proves insufficient.
This asymmetry appears throughout the experimental results. When models expressed uncertainty, participants correctly recognized low reliability. However, participants could not improve answers because they lacked relevant knowledge. Uncertainty communication provides a floor, not a ceiling, on performance.
Designing systems for novice users requires additional safeguards beyond confidence communication. Options include verification mechanisms, multiple independent sources, or deferring to human experts for uncertain cases. The key insight is recognizing that calibration alone does not ensure good outcomes when users cannot evaluate content.
BROADER CONTEXT
This work fits within growing research on AI trustworthiness and reliability. As AI systems integrate into critical applications, ensuring users can appropriately trust and verify outputs becomes essential. Overconfidence leads to automation bias, where users accept incorrect AI suggestions without questioning. Underconfidence leads to disuse, where users ignore helpful correct suggestions.
Recent work on AI explainability explored how to help users understand model reasoning. Feature highlighting shows which input regions influenced image classification. Attention visualization shows which words mattered for text analysis. Natural language rationales explain reasoning chains. However, research showed mixed results for these approaches improving human decision making.
The uncertainty communication angle adds a dimension. Rather than explaining how models reach conclusions, focus on communicating how confident models are in conclusions. Users may not need detailed reasoning traces if they know which outputs deserve trust and which require skepticism.
This connects to metacognition research in cognitive science. Human metacognition involves monitoring one's own knowledge, recognizing what one knows and does not know. Accurate metacognition correlates with better learning and decision making. Models with calibrated confidence exhibit something analogous to metacognition. The challenge involves transmitting this metacognitive state to users.
The finding that longer explanations increase perceived confidence despite not helping discrimination relates to cognitive fluency research. Information presented in fluent, easy to process formats seems more credible. Lengthy detailed explanations may increase processing fluency, leading users to infer greater accuracy. This represents a cognitive bias systems should avoid exploiting.
THE MISSING MEASUREMENTS
A striking gap in AI deployment is the absence of systematic uncertainty measurement in practice. Models clearly possess internal confidence representations. These remain largely hidden from users. Default interfaces present outputs as text without qualification.
Some argue that exposing uncertainty might reduce user trust excessively. Users might dismiss models as unreliable if frequently told "I am not sure." This concern assumes users currently hold appropriate trust levels. The research shows the opposite. Users display excessive trust, assuming higher accuracy than reality. Exposing uncertainty would recalibrate expectations toward accuracy.
Another argument suggests uncertainty communication adds complexity. Users want simple answers, not probabilistic assessments. This perspective treats users as passive consumers rather than decision makers. In any domain where AI assists important decisions, users need uncertainty information to decide how much weight to give suggestions.
The technical capability exists. Reading model confidence requires API access to token probabilities, available in research but not always in deployment. Making this information available and routing it to prompt selection requires minimal additional infrastructure. The barrier seems more cultural than technical.
Regulatory frameworks may eventually mandate uncertainty disclosure. As AI systems enter medical diagnosis, legal advice, and financial planning, regulations might require informing users about reliability. Proactive implementation would position companies favorably relative to future requirements. It would also demonstrate commitment to responsible deployment.
FROM DEMONSTRATION TO DEPLOYMENT
The research demonstrates proof of concept. Modified prompts improved calibration and discrimination significantly across multiple models and question types. The approach generalizes. Whether it performs equally well in practice depends on deployment details.
Real world usage differs from controlled experiments. Experiments presented isolated questions with model generated explanations. Users evaluated confidence for each question independently. Real usage involves extended conversations where context accumulates. Users may develop heuristics for assessing model reliability based on prolonged interaction.
The conversation context might help or hurt. It might help if users learn to recognize patterns distinguishing confident from uncertain responses, developing calibration through experience. It might hurt if early exchanges establish expectations that persist despite later contradicting evidence. Studying calibration in conversational contexts represents important future work.
Another deployment consideration involves adversarial users. The modified prompt approach works by including uncertainty language when confidence is low. Sophisticated users might learn to game this. They might probe for uncertainty expressions, then dismiss answers showing any doubt. This could lead to underuse of moderately confident but still useful suggestions.
Alternatively, users might learn that uncertainty expressions reliably indicate low accuracy. This would improve calibration but might reduce engagement. Finding the right balance between accurate communication and practical utility requires field testing. Laboratory results provide direction but cannot fully predict real world behavior.
System designers face tradeoffs. Maximum accuracy might involve refusing to answer when confidence is low. Maximum utility might involve attempting answers despite uncertainty, clearly communicating doubt. The right balance depends on application. Medical diagnosis requires erring toward caution. Creative brainstorming accepts speculation. One size does not fit all domains.
LOOKING FORWARD
The calibration gap between what models know and what users perceive they know represents a solvable problem. Models possess reasonable self knowledge. Prompt engineering can surface this knowledge through natural language. Users can interpret uncertainty expressions appropriately. The pieces exist. They need assembly.
Near term improvements seem achievable. Implementing confidence aware prompt selection requires modest engineering effort. Testing this in deployment would generate valuable data on real world effectiveness. Companies operating AI assistants could begin experiments immediately.
Longer term improvements might involve training procedures. Models trained explicitly to communicate uncertainty accurately might generate better default explanations. Rather than post hoc prompt modification, uncertainty calibration could become intrinsic to model behavior. Research exploring alternative training objectives would advance this direction.
The study focused on question answering, but implications extend to any AI generated content. News article generation, code completion, creative writing assistance, all involve uncertain outputs. Users benefit from understanding reliability regardless of domain. Generalizing uncertainty communication across use cases represents the broader challenge.
The findings also highlight the sophistication required of users. Even with improved calibration, users need judgment about when AI assistance helps and when it misleads. This demands AI literacy: understanding capabilities, limitations, and appropriate use. Education efforts alongside technical improvements will determine ultimate outcomes.
Perhaps the deepest implication involves rethinking the goal. Rather than creating AI assistants that always sound confident, aim for assistants that communicate uncertainty honestly. Rather than maximizing persuasiveness, maximize accuracy of confidence communication. This represents a fundamental shift in design philosophy.
Current practices often optimize for user satisfaction. Users prefer confident answers to tentative ones. They prefer detailed explanations to terse ones. Optimizing for satisfaction without regard to calibration produces overconfident systems. Optimizing for accurate uncertainty communication might initially reduce satisfaction scores but ultimately serves users better.
The tension between satisfying users in the moment and serving them well over time appears throughout technology design. Social media optimization for engagement creates addictive properties. Recommendation systems optimizing for clicks promote sensational content. AI assistants optimizing for impressive outputs generate overconfident explanations. In each case, focusing on long term user welfare requires resisting short term incentives.
The research provides tools for this resistance. It demonstrates that calibration gaps can be measured, understood, and reduced. It shows that users can interpret uncertainty appropriately when presented clearly. It offers practical approaches deployable today. Whether the AI industry embraces these tools depends on prioritizing reliability over impressiveness. That choice will determine whether AI assistants become trustworthy aids or unreliable oracles, confidently wrong when users most need accuracy.
PUBLICATION DETAILS:
Year of Publication: 2025
Journal: Nature Machine Intelligence
Publisher: Springer Nature
DOI: https://doi.org/10.1038/s42256-024-00976-7
CREDIT & DISCLAIMER: This article is based on original research conducted by scientists from the University of California, Irvine. Readers are strongly encouraged to consult the full research article for complete details, comprehensive data, methodology, and factual information. The original paper provides in depth technical analysis and should be referenced for academic or professional purposes.






