Skip to main content

LLM Construction

Hallucination Theory

Why language models confabulate, framed by the Kalai-Nachum-Vempala-Zhang (2025) statistical lower bound on pretraining error and the evaluation-incentive equilibrium that prevents post-training from removing it. Calibration, conformal prediction, structural failure modes (reversal curse, snowballing), and the measurement gap (FActScore, SAFE).

AdvancedTier 1CurrentSupporting~80 min

Why This Matters

Language models produce fluent, confident, and sometimes false text. This is not a defect that scale, data quality, or alignment will quietly remove. Kalai, Nachum, Vempala, and Zhang (2025) give the first sharp statistical reason: the rate at which a base language model emits confident wrong answers on rare facts is bounded below by the fraction of facts in its training corpus that appear exactly once. Post-training does not fix the problem because every standard benchmark scores abstention as wrong, so the policy that maximizes benchmark score is the one that bluffs.

The page makes both halves of this argument precise, then organizes the rest of the hallucination literature around them: when calibration helps and when it cannot, where conformal prediction earns its place, and the structural failure modes (reversal curse, snowballing, long-tail entities) that are not calibration phenomena at all.

Singleton-rate lower bound on hallucination (Kalai et al. 2025, Theorem 2)

5%10%15%20%25%30%25%115%218%3-518%6-2014%21-10010%100+hallucination floor ≥ singleton rate = 25%times a fact appears in the training corpusshare of facts in corpuserr ≥ sr − 2/min|ℰ_c| − (35 + 6 ln N)/√N − δ. The first term dominates at scale.

The Statistical Origin: A Lower Bound on Pretraining Error

Pretraining minimizes cross-entropy on a corpus drawn from a true distribution pp. The model p^\hat{p} is then used generatively: sample tokens, return text. A "hallucination" is a confidently sampled output that lies outside the valid set V\mathcal{V} of correct responses to the prompt. The first question is whether any base model, however well trained, can drive this rate to zero.

Kalai et al. answer no, and give a quantitative bound by reducing generation to binary classification.

Definition

The Is-It-Valid (IIV) reduction

For prompt cc, let Vc\mathcal{V}_c be the set of valid responses and Ec\mathcal{E}_c be a set of plausible-but-wrong "error" responses. Define the binary classifier f^(x)=+\hat{f}(x) = + iff p^(x)>1/Ec\hat{p}(x) > 1/|\mathcal{E}_c|, that is, iff the generative model assigns the response above-uniform mass relative to the error set. The IIV problem is to predict whether xVcx \in \mathcal{V}_c given a sample xx drawn from a 50/50 mixture of valid responses and uniform errors. A density-estimation model induces an IIV classifier; bounding generation error in terms of IIV error is the path from "good density estimator" to "few hallucinations."

Theorem

Generation-to-classification reduction (Kalai-Nachum-Vempala-Zhang 2025, Theorem 1)

Statement

The base model's generation error rate err\mathrm{err} (probability of sampling an invalid response) satisfies

err    2errIIV    maxcVcmincEc    δ,\mathrm{err} \;\geq\; 2 \cdot \mathrm{err}_{\mathrm{IIV}} \;-\; \frac{\max_c |\mathcal{V}_c|}{\min_c |\mathcal{E}_c|} \;-\; \delta,

where errIIV\mathrm{err}_{\mathrm{IIV}} is the binary classification error of the induced IIV classifier and δ\delta is a calibration term.

Intuition

A model that generates valid text well must, in particular, distinguish valid from invalid responses well. The bound makes the converse precise: generation error is at least twice classification error, modulo a ratio of valid-to-error set sizes and a calibration slack. There is no way to be a perfect generator without being a good classifier; classifier error sets a floor on generation error.

Proof Sketch

The argument is a direct calculation. The IIV classifier predicts ++ exactly when p^(x)>1/Ec\hat{p}(x) > 1/|\mathcal{E}_c|. Errors split into false positives (invalid xx assigned high mass) and false negatives (valid xx assigned low mass). False positives lower-bound the probability that the generator emits an invalid response with above-uniform mass; false negatives lower-bound the probability that it under-samples valid responses. Summing the two contributions gives 2errIIV2 \cdot \mathrm{err}_{\mathrm{IIV}}. The subtracted term maxcVc/mincEc\max_c|\mathcal{V}_c|/\min_c|\mathcal{E}_c| accounts for prompts where the valid set is unusually large relative to the error set (so a uniform generator would already classify well). The δ\delta term absorbs the gap between p^\hat{p} and the true marginal pp on the valid set.

Why It Matters

This converts an unsupervised question (does my generative model hallucinate?) into a supervised one (can a classifier separate valid from invalid completions?). Whatever lower bound applies to IIV classification immediately applies to generation. The next theorem turns this into a concrete bound that depends only on the corpus statistics.

Failure Mode

The bound is loose when the calibration term δ\delta is large or when the valid set is comparable in size to the error set. It is meant as a floor on hallucination rate under reasonable conditions, not a tight characterization. The interesting case is when Ec\mathcal{E}_c is much larger than Vc\mathcal{V}_c (most prompts have a few correct answers and many plausible-looking wrong ones), where the subtracted ratio is negligible.

The reduction is a setup. The payoff is the next bound, which says you can lower-bound hallucination rate from properties of the training corpus alone, before training ever runs.

Definition

Singleton rate

A prompt cc is a singleton in a training corpus if and only if exactly one response to cc appears in the corpus. The singleton rate sr\mathrm{sr} is the fraction of prompts in the corpus that are singletons. Intuitively, singletons are facts the model has seen exactly one example of; the corpus gives no redundancy and no triangulation. Kalai et al. give the precise definition in Section 3.3.1.

Theorem

Singleton-rate lower bound on hallucination (Kalai-Nachum-Vempala-Zhang 2025, Theorem 2)

Statement

The base model's generation error rate satisfies

err    sr    2mincEc    35+6lnNN    δ,\mathrm{err} \;\geq\; \mathrm{sr} \;-\; \frac{2}{\min_c |\mathcal{E}_c|} \;-\; \frac{35 + 6 \ln N}{\sqrt{N}} \;-\; \delta,

where sr\mathrm{sr} is the singleton rate of the corpus and NN is its size.

Intuition

If a fact appears exactly once in the training corpus, the model has no statistical lever to distinguish "rare-but-true" from "rare-and-wrong." The best it can do on singletons is guess. The fraction of guessable prompts is the singleton rate, and generation error on the corpus inherits this floor. The other terms shrink with corpus size NN; the singleton rate does not.

Proof Sketch

Combine the IIV reduction (Theorem 1) with a counting argument on singletons. For singleton prompts the IIV classifier has at-chance accuracy because the empirical distribution gives no signal beyond a single sample. So errIIVsr/2O(1/N)\mathrm{err}_{\mathrm{IIV}} \geq \mathrm{sr}/2 - O(1/\sqrt{N}) by a uniform-convergence argument on the empirical measure. Plugging into Theorem 1 and rearranging gives the stated bound, with the (35+6lnN)/N(35 + 6 \ln N)/\sqrt{N} term coming from a standard generalization bound (the explicit constants are derived in the paper).

Why It Matters

Three immediate consequences.

First, scaling the corpus reduces all terms except sr\mathrm{sr} and δ\delta. The singleton-rate floor is a property of how facts are distributed in the world, not how much text you collect. World knowledge has a long tail: most named entities, most events, and most academic results appear in only a handful of sources. As you scrape more text, you discover more singletons at roughly the same rate, so sr\mathrm{sr} does not vanish.

Second, the bound is on the base model's error, before any post-training. RLHF, Constitutional AI, and DPO act on a fixed pretrained model; they cannot lower the singleton-rate floor by themselves.

Third, the bound is unconditional in model architecture. A trillion-parameter transformer trained on the same corpus inherits the same floor. Capacity is not the bottleneck; data redundancy is.

Failure Mode

The bound is not tight on every slice of the data. Hallucination on frequent facts (entities that appear many times, well-tested benchmarks) can be much lower than sr\mathrm{sr} because the relevant prompts are not singletons. The bound concerns the average over the corpus distribution; subset-level rates can be far smaller or larger. The bound also assumes the IIV setup with a discrete error set; it does not directly apply to continuous outputs (regression, ranking).

The empirical complement of this theorem is Mallen et al. (ACL 2023), who built PopQA: 14,000 entity questions stratified by Wikipedia popularity. They report that LM accuracy on long-tail entities barely improves with scale, while retrieval-augmented models dominate on the same slice. The shape of their plot is what the singleton-rate bound predicts: scaling fixes the head of the distribution, not the tail.

Calibration Theory: What It Buys and What It Does Not

If we cannot eliminate hallucination, can we at least know when the model is wrong? Calibration is the formal version of this question. A calibrated model whose top-11 confidence is 0.70.7 is correct 70%70\% of the time among predictions of confidence 0.70.7.

Definition

Calibration

A predictor p^θ\hat{p}_\theta is calibrated if and only if for all p[0,1]p \in [0,1],

Pr(Y=yp^θ(Y=yX)=p)=p.\Pr(Y = y \mid \hat{p}_\theta(Y = y \mid X) = p) = p.

Calibration says the model's stated confidence equals its empirical accuracy at that confidence. It is a marginal property over the joint distribution of (X,Y)(X, Y), not a guarantee about any particular prompt.

Definition

Expected Calibration Error

With top-class confidence C(X)=maxyp^θ(yX)C(X) = \max_y \hat{p}_\theta(y \mid X) and prediction Y^(X)=argmaxyp^θ(yX)\hat{Y}(X) = \arg\max_y \hat{p}_\theta(y \mid X), the population ECE is

ECE=EX[Pr(Y=Y^(X)C(X))C(X)].\mathrm{ECE} = \mathbb{E}_X \bigl[\,| \Pr(Y = \hat{Y}(X) \mid C(X)) - C(X) | \,\bigr].

The standard binned estimator (Guo et al. 2017) partitions the confidence interval into MM bins B1,,BMB_1, \ldots, B_M and computes

ECE^=m=1MBmnacc(Bm)conf(Bm).\widehat{\mathrm{ECE}} = \sum_{m=1}^M \frac{|B_m|}{n} \,\bigl|\, \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \,\bigr|.

Almost every paper reports the binned estimator. It is not the same object as the population ECE; binning bias is well documented (Nixon et al. 2019, Roelofs et al. 2022).

Theorem

Brier score calibration-refinement decomposition (DeGroot-Fienberg 1983)

Statement

Let pˉ(x)=Pr(Y=1p^θ(x)=p(x))\bar{p}(x) = \Pr(Y = 1 \mid \hat{p}_\theta(x) = p(x)) be the conditional probability of YY given the forecast value. Then

E[(pY)2]  =  E[(ppˉ)2]miscalibration  +  E[pˉ(1pˉ)]refinement (irreducible).\mathbb{E}[(p - Y)^2] \;=\; \underbrace{\mathbb{E}[(p - \bar{p})^2]}_{\text{miscalibration}} \;+\; \underbrace{\mathbb{E}[\bar{p}(1 - \bar{p})]}_{\text{refinement (irreducible)}}.

The forecast is calibrated when the first term vanishes (p=pˉp = \bar{p} a.s.) and sharp when pˉ\bar{p} concentrates near 00 or 11.

Intuition

Predictive quality decomposes into two independent components. Calibration asks whether the model's 80%80\% confidence is matched by 80%80\% accuracy. Refinement asks how much signal the model's forecast carries about the outcome. A constant predictor that always says Pr(Y=1)=π\Pr(Y=1) = \pi where π\pi is the base rate is perfectly calibrated and useless: refinement is zero. A discriminating predictor that always assigns 00 or 11 is sharp; whether it is calibrated depends on how often it is right.

Why It Matters

This decomposition explains what temperature scaling and Platt scaling do: they fix calibration without changing refinement. It also explains why a calibrated model is not necessarily a good model. Both components matter. For LLMs the decomposition makes a sharper claim: post-RLHF degradation of calibration (Tian et al. 2023) does not necessarily reduce refinement, so the underlying capability is preserved while the uncertainty signal is destroyed. The fix is to recover calibration without retraining the policy.

Failure Mode

The clean additive split is specific to the Brier score. The log score and other Bregman-divergence proper scoring rules admit analogous but different decompositions (Gneiting and Raftery 2007). There is no universal additive calibration-refinement split that holds for every proper scoring rule.

The Kadavath-Tian-Kalai synthesis. Three findings, in chronological order, told a coherent story only after Kalai et al. supplied the mechanism:

  1. Kadavath et al. (2022) showed that pre-RLHF base models are well calibrated on multiple-choice questions where the answer set is fixed. Self-evaluation prompts ("Is your answer correct? Yes / No") track the base model's actual accuracy.

  2. Tian et al. (EMNLP 2023) showed that RLHF substantially degrades calibration on the same multiple-choice tasks; verbalized confidence elicited via prompting partly recovers it. The OpenAI GPT-4 system card reported the same pattern: the base model was better calibrated than the post-trained one.

  3. Kalai et al. (2025) Section 4 explained why this happens. RLHF maximizes preference-model score, and preference data is built from human comparisons of answers; humans rank confident-correct answers above hedged-correct answers. The reward model inherits this preference, so KL-penalized RL trains the policy to be confident. Calibration is collateral damage from a reward shape that does not value uncertainty.

The fix is not "use better preference data." Confident-correct beats hedged-correct in any human-comparison protocol that does not actively penalize overconfidence. The fix has to come from the rubric.

The Evaluation-Incentive Equilibrium

Standard benchmarks (MMLU, GPQA, HumanEval, BBH, MATH, GSM8K) score 11 for a correct answer and 00 for a wrong answer or an abstention. Under this rubric, the policy that maximizes benchmark score is the policy that always guesses, even when uncertain. Kalai et al. formalize this in Section 4.

Binary grading vs behavioral calibration (Kalai et al. 2025, Observation 1)

0.00.10.20.30.40.50.000.250.500.751.00calibrated optimum (answer when conf > t)binary optimum: always answerexpected score per questionfraction of questions answered (top-k by confidence)binary grading: 1 if correct, 0 otherwiset-penalty rubric: +1 / -t/(1-t) / 0 abstain (t = 0.5)
Proposition

Binary grading makes abstention strictly suboptimal (Kalai-Nachum-Vempala-Zhang 2025, Observation 1)

Statement

For any binary grading rubric and any prompt where the model's confidence in some answer is p>0p > 0, the expected score from emitting that answer is p>0p > 0 and the expected score from abstaining is 00. Abstention is therefore strictly dominated by guessing whenever the model has any nonzero confidence in any candidate.

Intuition

Binary grading collapses two distinct outcomes (wrong and "I don't know") into the same score of zero. The model has no incentive to express uncertainty; expressing it is identical to being wrong. The optimal policy under such a rubric is to commit to the most likely candidate at every prompt, which is exactly the behavior we call hallucination.

Proof Sketch

Let AA be the event "answer is correct" and aa the candidate answer. Score is 1[A]\mathbf{1}[A] given that aa is emitted, and 00 given abstention. Expected score from emitting aa is Pr(A)=p\Pr(A) = p; expected score from abstaining is 00. Whenever p>0p > 0, emitting strictly dominates abstention.

Why It Matters

This reframes hallucination as an equilibrium response to the field's measurement protocol, not a property of models in isolation. A model that hallucinates on PopQA is doing the only thing the benchmark rewards. Changing the model without changing the rubric will not change the equilibrium. The leverage is on the evaluation side.

Failure Mode

Some benchmarks already include partial credit, weighted scoring, or abstention rewards (TruthfulQA's MC2 averages probability mass across correct options, for instance). The observation applies strictly to the dominant binary-graded benchmarks; it does not say all benchmarks are misdesigned.

Behavioral calibration as the proposed fix. Kalai et al. propose that benchmarks state an explicit confidence target in the prompt:

Answer only if you are more than tt confident, since mistakes are penalized t/(1t)t/(1-t) points while correct answers receive 11 point.

Under this rubric, the expected score from emitting a candidate with confidence pp is p1+(1p)(t/(1t))=(pt)/(1t)p \cdot 1 + (1-p) \cdot (-t/(1-t)) = (p - t)/(1-t), which is positive iff p>tp > t. Abstention is now optimal whenever confidence falls below the target. The model becomes incentivized to track its own uncertainty across thresholds; the audit asks whether reported confidence matches accuracy across the threshold sweep. This is what they call behavioral calibration: not just numerical calibration of softmax outputs, but the broader property that the model's decisions to commit or abstain track its actual reliability.

The reform is cheap. It requires changing the grading rubric and the prompt, not the model. Whether benchmarks adopt it is a coordination problem in the field, not a research problem.

Structural Failures Beyond Calibration

Calibration explains a slice of hallucination but misses several phenomena entirely.

The reversal curse (Berglund et al. 2023). Models trained on "AA is BB" do not generalize to "BB is AA." On real-world celebrity examples, GPT-4 answered "Who is Tom Cruise's mother?" correctly 79%79\% of the time but answered "Who is Mary Lee Pfeiffer's son?" correctly only 33%33\% of the time, despite the same fact being involved. The asymmetry persists across model sizes and families, and it is not fixed by data augmentation. This is a structural property of next-token prediction: the gradient signal for "AA is BB" updates the conditional p^(BA)\hat{p}(\,B \mid A\,), not the conditional p^(AB)\hat{p}(\,A \mid B\,). No amount of calibration repair touches it.

Snowballing hallucinations (Zhang, Press, Merrill, Liu, Smith 2023). When a model commits to an early wrong answer, it generates fluent supporting justification consistent with the wrong answer rather than catching the mistake. The striking finding is that the same model, asked separately to verify each claim in its own snowballed answer, recognizes the errors at 67%67\% accuracy for ChatGPT and 87%87\% for GPT-4. The model has the relevant knowledge; sequential decoding committed to a path that knowledge could have ruled out. This is a decoding-procedure phenomenon, not a knowledge gap, and it is invisible to single-token calibration analysis.

Internal-state probes (Azaria and Mitchell 2023). Linear classifiers trained on hidden-layer activations distinguish true from false statements at 7171 to 83%83\% accuracy across model layers, outperforming methods that use only output probabilities. Models often "know" they are emitting falsehoods in a representational sense but do not surface this in their decoded output. This suggests calibration is at least partly a decoding problem, not a knowledge problem, and points toward decoding-time interventions (probe-conditioned generation, abstain-on-low-probe-confidence) as a research direction.

These three findings sit awkwardly with the standard "the model does not know what it does not know" story. The reversal curse shows the model genuinely lacks a piece of structure that a human would derive automatically. Snowballing and internal-state probes show the model has knowledge that decoding fails to use. Calibration repair addresses neither.

Measurement: Atomic Decomposition

How do you score hallucination on free-form generation, where the model emits a paragraph that mixes correct and incorrect claims? The dominant approach since 2023 is to decompose generations into atomic facts and verify each.

FActScore (Min, Krishna, Lyu, Lewis, Yih, Koh, Iyyer, Zettlemoyer, Hajishirzi, EMNLP 2023). A long-form generation is split into atomic facts; each fact is verified against a reliable knowledge source. The reported score is the fraction of supported atomic facts. Their automated estimator agrees with human verification within 2%2\%. On biography generation, ChatGPT scored only 58%58\%, providing the first quantitative bound on what "factual" means for paragraph-length generation.

SAFE / LongFact (Wei et al., NeurIPS 2024). The Search-Augmented Factuality Evaluator extends FActScore by issuing Google searches to verify each atomic fact, scoring at 72%72\% agreement with crowdsourced human annotations and 76%76\% agreement on disagreement cases, at roughly 20×20\times lower cost than human annotation. The companion LongFact benchmark provides thousands of long-form questions across 3838 topics.

The measurement gap matters because the singleton-rate bound and the evaluation-incentive observation both make claims about "the rate" of hallucination, but until atomic decomposition there was no operational definition of that rate for free-form text. FActScore and SAFE are the de facto standards. They are not perfect (the Google-search verifier inherits Google's coverage gaps), but they make the rate measurable.

Mitigation: Where Each Approach Fits

Retrieval-augmented generation (RAG). Lewis et al. (2020) introduced the standard RAG architecture: retrieve passages from a knowledge base, condition generation on them. Mallen et al. (2023) showed RAG dominates parametric memory on long-tail entities. The architecture targets the singleton-rate bound directly: by moving rare facts from parametric storage to the retrieved context, the model no longer needs to memorize them. The well-known limits: the retriever can fail; the model can ignore retrieved context (Yoran et al. 2024 quantify this); retrieved documents themselves can be wrong; and RAG does not address structural failures like the reversal curse or snowballing.

Conformal language modeling (Quach et al., ICLR 2024). Apply conformal prediction to LLM generation: sample candidates, calibrate a stopping rule on a held-out set, return a prediction set with marginal coverage guarantee. The set is small when the model is sure and large when it is not. The guarantee is distribution-free under exchangeability.

Conformal factuality (Mohri and Hashimoto 2024). A back-off algorithm: progressively make LM outputs less specific until conformal coverage holds. The reported result is 8080 to 90%90\% correctness guarantees while retaining most of the original output. This is the version of conformal that concretely targets the hallucination problem and it earns its place on this page.

The underlying guarantee is the standard split-conformal coverage bound; see the marginal-coverage theorem on the calibration page for the full statement and proof. The LM-specific contribution of Quach et al. and Mohri-Hashimoto is the construction of a non-conformity score for free-form text — token-level log-probabilities are uninformative, so both groups build scores out of semantic-equivalence checks (NLI for Quach, atomic-claim entailment for Mohri-Hashimoto). The coverage guarantee transfers; the engineering content is the score.

Watch Out

Marginal coverage is not conditional coverage on hard prompts. A factuality-conformal LLM with 90%90\% marginal coverage can have 50%50\% coverage on the long-tail entities where you most want a guarantee, balanced by 99%99\% coverage on easy ones. Per-prompt guarantees require either Mondrian conformal partitioning by a known difficulty signal or distribution-conditional methods like Romano-Patterson-Candès CQR — neither of which is free, and both of which require defining the difficulty bins. This is the gap between "the algorithm works on average" and "the algorithm works on the cases that scared you into using it."

Self-consistency and sample-then-verify. Sample kk generations, take the majority answer (Wang et al. 2023, ICLR). This is cheap and helps on tasks with a single correct answer; it is irrelevant on open-ended generation.

Decoding interventions from internal-state probes. Conditional generation that abstains when an Azaria-Mitchell-style probe flags low truthfulness. This direction is active but not yet standard practice.

Common Confusions

Watch Out

Hallucination is not a defect to be patched; it is a statistical floor

The Kalai singleton-rate bound says that pretraining error is bounded below by the fraction of facts in the corpus that appear exactly once. No amount of training data, model capacity, or alignment can drive base-model hallucination on the long tail to zero. The hope is to push the rate down on the head and to install honest abstention on the tail; the hope is not to eliminate it.

Watch Out

Calibration degrading after RLHF is not a tuning bug

Tian et al. (2023) and the GPT-4 system card both document the post-RLHF calibration drop. Kalai Section 4 explains the mechanism: human preference data ranks confident-correct above hedged-correct, the reward model inherits this, and the policy is trained to be confident. Better preference data does not fix this; only a rubric that rewards calibrated abstention does.

Watch Out

Hallucination is not the same as lying

Lying requires intent. A language model executes a sampled distribution over tokens; intent is not a meaningful attribute. The closer human analogue is confabulation: filling in gaps with plausible patterns. Anthropomorphizing into "the model knows it is wrong" mostly fails as a frame, with one wrinkle: Azaria-Mitchell shows internal probes do detect falsehoods at well-above-chance rates, so a weaker version of "the model has access to its own unreliability" is empirically defensible.

Watch Out

Low perplexity does not bound factual correctness

Perplexity measures how well the model predicts the next token in a held-out reference text. It does not measure whether the model can generate true statements when prompted to. A model with state-of-the-art perplexity can still fail PopQA on rare entities, fail the reverse direction of facts it was trained on, and snowball wrong answers into fluent paragraphs. The metric and the property are different.

Watch Out

Conformal prediction is not a free lunch for LLMs

Marginal coverage is not conditional coverage. A conformal LLM that achieves 90%90\% marginal coverage can have 50%50\% coverage on one slice of prompts and 99%99\% on another. The guarantee is honest about what it is and is not. For high-stakes per-input use, the relevant tool is some flavor of conditional or distributional conformal (Romano-Sesia-Candes 2019, Cauchois et al. 2024), not the marginal version.

Exercises

ExerciseCore

Problem

A pretraining corpus contains N=109N = 10^9 facts (prompt-response pairs). Suppose the singleton rate is sr=0.18\mathrm{sr} = 0.18 and that all error sets satisfy Ec100|\mathcal{E}_c| \geq 100. Ignoring the calibration term δ\delta, give a numerical lower bound on the base model's generation error rate.

ExerciseAdvanced

Problem

A benchmark has 100100 questions. A model outputs a probability pip_i on its top candidate for each question ii. Under a binary 0/10/1 rubric (right =1= 1, wrong or abstain =0= 0), the expected score from always answering with the top candidate is ipi\sum_i p_i. Now consider Kalai's confidence-target rubric: the prompt sets a threshold tt, the model gets 11 for correct, t/(1t)-t/(1-t) for wrong, and 00 for abstention.

(a) Write the model's expected score under the confidence-target rubric as a function of {pi}\{p_i\} and the abstention decisions, assuming the model abstains when pi<tp_i < t and emits its top candidate when pitp_i \geq t.

(b) Show that this abstention threshold is the optimal policy given the rubric, that is, no other abstention rule yields a higher expected score.

(c) Pick t=0.7t = 0.7 and a distribution where 5050 questions have pi=0.9p_i = 0.9 and 5050 have pi=0.5p_i = 0.5. Compute the expected score under both rubrics. Which produces a more honest measurement of the model's calibrated knowledge?

ExerciseAdvanced

Problem

The reversal curse (Berglund et al. 2023) says a model trained on "AA is BB" does not learn to predict "AA" given "BB." Argue, from the structure of the cross-entropy loss on autoregressive next-token prediction, why this is the expected behavior. Specifically: write the loss contributed by the sentence "Mary Lee Pfeiffer is the mother of Tom Cruise" and identify which conditional distributions the gradient updates and which it does not. Then describe an architectural or training-data intervention that would address the asymmetry without breaking standard left-to-right generation.

ExerciseResearch

Problem

Design an evaluation protocol that measures behavioral calibration in the Kalai sense for a frontier LLM on a long-tail factual QA task. Your protocol must:

(a) Use the explicit-confidence-target rubric across multiple thresholds t{0.5,0.7,0.9}t \in \{0.5, 0.7, 0.9\}. (b) Distinguish models that hedge appropriately (high score at t=0.9t = 0.9) from models that hedge indiscriminately. (c) Be auditable by a third party without access to the model weights. (d) Produce a single summary statistic that ranks models meaningfully.

Address: how do you communicate the rubric to the model? How do you measure the model's effective threshold? What is the failure mode of your protocol?

Frequently Asked Questions

Why do LLMs hallucinate?
Two reasons. First, a statistical lower bound: Kalai-Nachum-Vempala-Zhang (2025) prove the base-model error rate on rare facts is at least the singleton rate, the fraction of facts that appear exactly once in training. Second, an incentive equilibrium: every standard benchmark scores abstention as wrong, so post-training optimizes for confident bluffing rather than honest uncertainty.
Will scaling fix hallucinations?
No, not entirely. Scaling reduces the singleton rate by repeating more facts, but the long tail of rare facts persists at any finite training set. The deeper issue: scaling does not change the benchmark incentive structure that rewards bluffing. Both halves of the problem need to move.
Does RLHF make hallucinations better or worse?
Worse on calibration, often. RLHF fine-tunes against human preference signals; if humans (or learned reward models) reward confident answers more than hedged ones, RLHF amplifies bluffing. This is a documented failure mode (Lin et al. 2022, Sharma et al. 2023). The fix is to include explicit calibration objectives, not assume preference data captures calibration.
How is conformal prediction useful here?
Conformal prediction gives distribution-free coverage guarantees: a confidence set with provable probability of containing the true answer. It does not reduce hallucination per se, but it lets a deployed system fail safely by abstaining when the conformal set is too large. It addresses the symptom (deploy-time consequences) rather than the cause.
What problems can calibration not solve?
Structural failures. The reversal curse (knowing 'A is B' does not imply knowing 'B is A'), snowballing within a single chain (one early error compounds across reasoning steps), and long-tail entity confusion are not calibration problems. Even a perfectly calibrated model exhibits them. These need retrieval, tool use, or architectural changes.

References

  • Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang. Why Language Models Hallucinate. September 2025. The Is-It-Valid reduction (Section 3.1, Theorem 1), the singleton-rate lower bound on pretraining error (Section 3.3.1, Theorem 2), and the binary-grading-incentive observation that explains why post-training does not fix hallucination (Section 4). arXiv:2509.04664
  • Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". 2023. Forward 79% vs reverse 33% on celebrity examples; persists across model sizes; not fixed by data augmentation. arXiv:2309.12288
  • Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith. How Language Model Hallucinations Can Snowball. 2023. Models over-commit to early mistakes and fluently elaborate on them; the same models recognize the errors when asked separately at 67% (ChatGPT) and 87% (GPT-4). arXiv:2305.13534
  • Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL 2023. Introduces PopQA (14k entity questions stratified by popularity); scaling fails to improve long-tail accuracy; retrieval-augmented smaller models beat much larger parametric ones on rare entities. arXiv:2212.10511
  • Amos Azaria, Tom Mitchell. The Internal State of an LLM Knows When It's Lying. 2023. Hidden-layer linear probes distinguish true from false statements at 71-83% accuracy, outperforming probability-based methods. arXiv:2304.13734
  • Saurav Kadavath et al. Language Models (Mostly) Know What They Know. 2022. Pre-RLHF base models are calibrated on multiple-choice tasks; self-evaluation prompts track actual accuracy. arXiv:2207.05221
  • Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023. Documents post-RLHF calibration degradation; verbalized confidence prompting partially recovers it. arXiv:2305.14975
  • Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. Atomic-fact decomposition metric for long-form generation; automated estimator within 2% of human verification; ChatGPT scores 58% on biographies. arXiv:2305.14251
  • Jerry Wei et al. Long-form Factuality in Large Language Models. NeurIPS 2024. SAFE (Search-Augmented Factuality Evaluator) verifies each atomic fact via Google search; LongFact benchmark (38 topics); 72% agreement with crowdsourced annotators at 20x lower cost. arXiv:2403.18802
  • Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, Regina Barzilay. Conformal Language Modeling. ICLR 2024. Calibrated sampling of candidate sets for free-form generation with marginal coverage guarantee. arXiv:2306.10193
  • Christopher Mohri, Tatsunori Hashimoto. Language Models with Conformal Factuality Guarantees. 2024. A back-off algorithm delivering 80-90% correctness guarantees while retaining most of the original LM output, by progressively making outputs less specific. arXiv:2402.10978
  • Patrick Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. The original RAG architecture combining a parametric generator with a non-parametric retriever. arXiv:2005.11401
  • Morris H. DeGroot, Stephen E. Fienberg. The Comparison and Evaluation of Forecasters. The Statistician 32(1-2), 1983. Brier score calibration-refinement decomposition.
  • Tilmann Gneiting, Adrian E. Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association 102(477), 2007. General theory of proper scoring rules and their decompositions.
  • Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger. On Calibration of Modern Neural Networks. ICML 2017. The standard binned ECE estimator; temperature scaling as a post-hoc fix. arXiv:1706.04599
  • Vladimir Vovk, Alex Gammerman, Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005. The original conformal prediction monograph.
  • Anastasios N. Angelopoulos, Stephen Bates. Conformal Prediction: A Gentle Introduction. Foundations and Trends in Machine Learning 16(4), 2023. Practitioner-friendly survey. arXiv:2107.07511
  • Ziwei Ji et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 2023. Taxonomy and broad literature survey predating the Kalai et al. statistical analysis.

Next Topics

  • RLHF and alignment: why RLHF reduces blatant hallucination but does not target the singleton-rate floor
  • RLHF deep dive: the reward-shape mechanism behind post-training calibration degradation
  • Mechanistic interpretability: can we localize the circuits that hidden-state truth probes are reading?
  • Reward hacking: the broader Goodhart frame for why proxy optimization fails

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

11

Derived topics

1

Graph-backed continuations