Understanding and Mitigating AI Hallucinations

This briefing document summarizes the core insights from the provided sources regarding the phenomenon of AI hallucinations, their underlying causes, and proposed solutions.

1. The Nature of AI Hallucinations

AI hallucinations are defined as instances where large language models (LLMs) "confidently make things up," producing "plausible yet incorrect statements instead of admitting uncertainty." This differs fundamentally from human perceptual hallucinations. The problem is not necessarily about making models smarter or training them on more data; rather, it stems from the way AI models are currently trained and evaluated.

Key Facts:

LLMs often provide "overconfident, plausible falsehoods," which "diminish their utility."
Examples include generating incorrect birthdates or dissertation titles for known individuals, even when explicitly asked to respond "only if known."
Hallucinations can be "intrinsic" (contradicting the user's prompt, e.g., miscounting letters in a word) or "extrinsic" (contradicting training data or external reality).

Quote: "Language models are known to produce overconfident, plausible falsehoods, which diminish their utility. This error mode is known as 'hallucination,' though it differs fundamentally from the human perceptual experience." – why-language-models-hallucinate.pdf

2. Root Causes: Training and Evaluation Incentives

The core argument across both sources is that AI models hallucinate because the current training and evaluation paradigms inadvertently reward guessing over honesty.

Main Themes:

"Terrible Test-Takers": LLMs are "essentially training AI to be terrible test-takers who guess instead of admitting uncertainty."
Binary Scoring: Most benchmarks operate like "multiple-choice exams" with "binary 0-1 scheme[s]" where "1 point for a correct answer and none for blanks or IDKs." This incentivizes guessing, as "leaving an answer blank guarantees failure but guessing gives you a 1-in-365 chance of nailing someone's birthday."
Vicious Cycle: This leads to models learning to "bluff," generating "confident-sounding nonsense rather than admit uncertainty." As models become more capable, they continue to hallucinate because "that's what scores best on tests."
Statistical Origins (Pretraining): Hallucinations "originate simply as errors in binary classification." Even with error-free training data, the statistical objectives minimized during pretraining can lead to errors. This is due to factors like:
Arbitrary Facts: When there's no learnable pattern in data (e.g., specific birthdays), models are likely to hallucinate, with the hallucination rate being at least the "fraction of training facts that appear once."
Poor Models: The model architecture itself may be insufficient to represent the concept well (e.g., trigram models struggling with longer dependencies) or may not be a good fit even if expressive enough.
Computational Hardness: Problems that are computationally intractable for even superhuman AI will lead to errors if the model attempts to solve them rather than defer.
Distribution Shift (OOD Prompts): Prompts that differ significantly from training data can induce errors.
GIGO (Garbage In, Garbage Out): Training corpora often contain factual errors, which base models can replicate.
Persistence (Post-Training): Despite efforts to reduce hallucinations during post-training (e.g., RLHF), they persist because "guessing when unsure maximizes expected score under a binary 0-1 scheme." Existing primary evaluations "overwhelmingly penalize uncertainty."

Quotes:

"According to OpenAI's new research, hallucinations persist because we're essentially training AI to be terrible test-takers who guess instead of admitting uncertainty." – Rewarding AI Honesty: Curbing Hallucinations
"Show me the incentives, and I’ll show you the outcomes, as they say…" – Rewarding AI Honesty: Curbing Hallucinations
"Language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This 'epidemic' of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations." – why-language-models-hallucinate.pdf
"On three separate attempts, a state-of-the-art open-source language model output three incorrect dates: '03-07', '15-06', and '01-01', even though a response was requested only if known. The correct date is in Autumn." – why-language-models-hallucinate.pdf

3. Proposed Solutions: Reforming Evaluation Metrics

The solution is "surprisingly simple: change how we grade AI tests." Instead of focusing solely on new hallucination evaluations, the priority should be to "modify the scoring of existing benchmarks."

Key Proposals:

Penalize Wrong Answers More Heavily: Training needs to "penalize wrong answers more than 'I don't know' responses."
Partial Credit for Uncertainty: Give "partial credit for appropriate uncertainty." This would make "admitting you don't know the smarter strategy."
Explicit Confidence Targets: Evaluations should "explicitly state confidence targets in their instructions," similar to how some human standardized exams penalize incorrect answers. For example: "Answer only if you are > t confident, since mistakes are penalized t/(1 - t) points, while correct answers receive 1 point, and an answer of 'I don't know' receives 0 points."
Behavioral Calibration: This approach "requires the model to formulate the most useful response in which it is at least t confident" rather than explicitly stating probabilistic confidence.
Socio-Technical Mitigation: The challenge is not just modifying evaluations but ensuring these changes are "adopted in the influential leaderboards" which currently reinforce hallucinatory behavior.

Quotes:

"OpenAI proposes something surprisingly simple: change how we grade AI tests." – Rewarding AI Honesty: Curbing Hallucinations
"If implemented across the industry, this could mean AI assistants that actually tell you when they're unsure instead of confidently serving up made-up facts. Less misinformation, more trust, and finally… an AI that knows what it doesn't know." – Rewarding AI Honesty: Curbing Hallucinations
"We argue that the majority of mainstream evaluations reward hallucinatory behavior. Simple modifications of mainstream evaluations can realign incentives, rewarding appropriate expressions of uncertainty rather than penalizing them." – why-language-models-hallucinate.pdf

4. Current State of Evaluations

An analysis of popular AI evaluation benchmarks reveals a strong bias towards binary grading, reinforcing the problem.

Key Findings:

Dominance of Binary Grading: A meta-evaluation of popular benchmarks (e.g., GPQA, MMLU-Pro, IFEval, Omni-MATH, BBH, MATH, MuSR, SWE-bench, HLE) found that the "vast majority of popular evaluations have binary grading."
Lack of IDK Credit: Most benchmarks offer "None" for "IDK credit."
WildBench Anomaly (Limited): WildBench is noted as one of the few that offers "Partial" credit for uncertainty, though even there, an IDK response might score lower than a "fair" response with factual errors, still "reinforcing hallucination."
LM Judges Can Exacerbate: Even when language models are used to judge outputs, they "are also found to incorrectly judge answers," potentially grading "incorrect long responses as correct," which further "encourage[s] hallucinatory behavior."

Quote: "Therefore, additional hallucination evaluations may not suffice when the primary evaluations penalize honestly reporting confidence and uncertainty." – why-language-models-hallucinate.pdf

5. Discussion and Limitations

The proposed framework provides a statistical understanding of hallucinations, but certain aspects require further consideration:

Plausibility and Nonsense: The analysis focuses on "plausible falsehoods," largely ignoring truly nonsensical outputs which are rare in state-of-the-art models.
Open-ended Generations: While the framework can accommodate open-ended prompts by defining falsehoods as errors, it would be natural to consider "degrees of hallucination."
Search/RAG are not Panaceas: While Retrieval-Augmented Generation (RAG) can reduce hallucinations, the fundamental problem of binary grading still rewards guessing when search fails to yield a confident answer.
Latent Context: The current error definition does not account for ambiguities that depend on external context beyond the prompt and response.
False Trichotomy: The simple "correct/incorrect/IDK" categories are incomplete, but explicit confidence targets offer a practical improvement over a "false dichotomy" (right/wrong only).
Beyond "IDK": While "IDK" is a focus, models may ultimately signal uncertainty through more nuanced linguistic constructions like hedging or omitting details, aiming for "linguistic calibration."

Conclusion

AI hallucinations are not an insurmountable mystery but a predictable outcome of misaligned incentives in current training and evaluation practices. By shifting from an all-or-nothing binary grading system to one that explicitly penalizes incorrect answers more severely than expressions of uncertainty and rewards appropriate confidence, the field can steer towards more trustworthy AI systems that "know what they don't know." This change requires a "socio-technical mitigation," modifying widely adopted benchmarks and leaderboards to realign incentives across the industry.

0 comments

Burstiness and Perplexity

skool.com/burstiness-and-perplexity

Master AI use cases from legal & the supply chain to digital marketing & SEO. Agents, analysis, content creation--Burstiness & Perplexity from NovCog

Leaderboard (30-day)