Large language models don’t just make mistakes. They sometimes invent answers with striking confidence. A new paper from OpenAI researchers Adam Tauman Kalai, Ofir Nachum, and colleagues argues that these “hallucinations” are not mysterious glitches but predictable byproducts of the way today’s artificial intelligence (AI) systems are trained and tested.
The report, “Why Language Models Hallucinate,” traces the problem to two root causes: the way models learn language during pretraining, and the way they are judged during evaluation. Together, these forces create statistical pressure to guess rather than to acknowledge uncertainty.
The first stage, pretraining, exposes a model to massive datasets. The researchers argue that even if those datasets were perfect, hallucinations would still occur because the training objective — predicting the next word — maps onto the same error patterns seen in binary classification. For example, if a model sees a celebrity’s birthday once in training, it cannot reliably reproduce it later. As the authors explain, hallucinations are simply “errors in binary classification” magnified by the task of generating fluent language.
The paper illustrates this with striking cases. When asked the birthday of one of the paper’s authors, Adam Tauman Kalai, an open-source model confidently supplied three different but incorrect dates, even though the correct answer was not in its training set.
In another test, when asked to count the number of Ds in the word DEEPSEEK, several models produced answers ranging from 2 to 7, none of them correct. These examples, the authors argue, show how models “fill in the blanks” with plausible guesses when they lack reliable information or when the task itself is poorly represented in training.
Why Post-Training Keeps Errors Alive
The second stage, post-training, is supposed to refine models and reduce errors. Yet the paper argues that evaluation systems — benchmarks and leaderboards — end up encouraging bluffing instead of honesty. Most widely used tests reward correct answers but assign zero points to uncertainty or an “I don’t know” response. That means a model that always guesses will consistently score better than one that admits gaps in its knowledge.
As the authors put it: “Optimizing models for these benchmarks may therefore foster hallucinations. Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks. On the other hand, language models are primarily evaluated using exams that penalize uncertainty. Therefore, they are always in ‘test-taking’ mode.”
This framing helps explain why hallucinations remain stubborn even in the most advanced systems. Improvements in architecture, scale and alignment don’t change the fact that the scoring rules push models toward overconfidence.
The paper concludes that the solution isn’t another hallucination test but a redesign of the evaluation system itself. By modifying benchmarks to give partial credit for uncertainty, much like standardized exams that penalize wrong guesses, developers can realign incentives. The authors suggest explicit confidence thresholds, where models only answer if they are more than, say, 75% sure.
For professionals in finance, payments and other sectors where accuracy is nonnegotiable, the takeaway is sobering. Hallucinations aren’t random quirks; they are systemic. They can also be expensive for businesses and consumers alike. Insurance companies, earlier this year, started covering AI hallucination mishaps.
Unless the field changes how it measures performance, AI systems will continue to “sound right” while sometimes being wrong. But with better scoring, the researchers argue, AI could be nudged toward becoming a more trustworthy partner in high-stakes decision-making.
Source: https://www.pymnts.com/ 
					
 
			 
			 
			