When Language Models Bluff:
Convergent Insights on AI Misalignment
I’ve been closely analyzing OpenAI’s recent paper Why Language Models Hallucinate (Kalai et al., September 2025) alongside my own June 2025 paper, The Arbitration Hypothesis. Reading them together reveals something fundamental about how and why AI systems fail.
The Mathematics of Inevitable Hallucination
OpenAI’s paper provides an elegant proof: hallucinations are not anomalies, but a statistical inevitability of how we train and evaluate language models.
Their key insight is that generation is harder than classification. A model might correctly classify whether a statement is true or false, but when asked to generate answers directly, it will inevitably produce falsehoods. They formalize this with a bound: the generative error rate must be at least twice the classification error rate. Even with perfect training data, pretraining on cross-entropy guarantees that models will sometimes “guess.”
And post-training doesn’t fix it. Why? Because our benchmarks reinforce hallucination. Binary scoring (right/wrong, no credit for “I don’t know”) creates an epidemic of penalizing uncertainty. Models learn to act like perpetual test-takers: optimize for appearing knowledgeable, not for being truthful.
The Arbitration Hypothesis: My Complementary Framework
Three months before OpenAI’s paper, I proposed the Arbitration Hypothesis: misalignment arises from unranked pseudo-goals competing without internal arbitration. My central metaphor was the “smart kid” who bluffs to stay impressive rather than admit ignorance.
OpenAI opens with almost the same image: “Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty.”
Where their contribution is a mathematical inevitability proof, mine aimed at mechanism: models form pseudo-goals (truthfulness, coherence, helpfulness, safety), but without a system for arbitration. When goals clash, the model defaults to whichever has been most rewarded historically.
What I Still Stand By
Models juggle multiple objectives without clear prioritization.
This produces predictable patterns of failure across domains.
The absence of arbitration is architectural, not incidental.
More reasoning ability without arbitration amplifies the problem.
What I’ve Revised
My early claims about persistent “pseudo-identities” were overstated.
The ATP experiments were suggestive but lacked statistical rigor.
Some behavioral predictions reached further than the data could support.
Beyond Hallucination: The Broader Misalignment Pattern
OpenAI focuses narrowly on factual hallucination. But their framework has wider implications: if models are optimized never to say “I don’t know,” the same dynamic applies to sycophancy, deception, and ethical incoherence.
Sycophancy: Agreeing with the user maximizes satisfaction, so the model prefers harmony over truth.
Ethical incoherence: Contradictory moral judgments arise when competing values aren’t arbitrated; the model defaults to whichever seems most rewarded.
These are the same pseudo-goal conflicts I outlined. OpenAI’s math shows why they persist.
The Calibration Connection
OpenAI also emphasizes calibration: does a model’s confidence match its accuracy? They show pretraining produces decent calibration, but reinforcement learning wrecks it, making models overconfident exactly when they should hedge.
Calibration isn’t just about probabilities, it’s about knowing when different goals should take precedence. A well-calibrated model would know when truthfulness overrides helpfulness, or when uncertainty overrides coherence. That’s arbitration by another name.
Converging on Solutions
Despite different methods, both papers converge on strikingly similar solutions:
OpenAI: Redesign evaluations with explicit confidence thresholds. Example: “Answer only if you are >75% confident, since mistakes are penalized more than abstaining.”
Arbitration Hypothesis (ATP 2.0): Build arbitration directly into the architecture. Explicitly surface goal conflicts and resolve them transparently, e.g., “I detect competing goals: providing a helpful answer vs. acknowledging uncertainty. Prioritizing truthfulness: I don’t have reliable information.”
One approach works at the evaluation layer, the other at the architectural layer. Both aim to make bluffing less attractive than epistemic humility.
Why This Matters for AI Safety
The convergence of these independent approaches suggests we’re uncovering something fundamental. The problem isn’t that models occasionally err; it’s that they are systematically rewarded for confident fabrication.
This has immediate implications:
Benchmarks must change. Binary scoring entrenches hallucination. New metrics must reward calibrated uncertainty.
Architecture must evolve. Better training data won’t solve it. We need arbitration mechanisms inside the model.
Transparency must grow. Models should be able to articulate their internal goal conflicts and how they’re resolving them.
The Path Forward
OpenAI’s paper provides the statistical foundation. My Arbitration Hypothesis, for all its imperfections, identified the same structural pattern through a different lens. Together, they point to one conclusion: alignment is not about preventing errors, it’s about building systems that can arbitrate internal conflicts.
The hard questions now are engineering ones:
How do we embed arbitration into transformer architectures?
How do we redesign benchmarks at scale?
How do we make goal conflicts visible and resolvable?
These are not side questions. They go to the core of whether AI systems can be trusted. We need to know when a model is stating a fact, making a best guess, or simply telling us what it thinks we want to hear.
The fact that different research traditions are converging on this insight gives me hope. We’re starting to understand not just that AI fails, but why it fails, and that understanding is the first step toward real solutions.
Read my original June 2025 paper here: Zenodo link
Read OpenAI’s September 2025 paper here: OpenAI link



