Calibration and Uncertainty Quantification

Sneiderman, Robby

AI Safety

Calibration and Uncertainty Quantification

When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

Logistic Regression Bits Nats Perplexity Bpb Decoding Strategies Goodness of Fit Tests

Quiz (3)Prereq Map

Why This Matters

A model that reports 90% confidence but is correct only 60% of the time is dangerous. In medical diagnosis, autonomous driving, and financial risk assessment, decisions depend not just on predictions but on how much to trust those predictions. Calibration is the property that makes predicted probabilities meaningful.

Modern neural networks are often poorly calibrated: they tend to be overconfident. A ResNet trained on ImageNet may assign 95% confidence to predictions that are correct only 80% of the time. Post-hoc calibration methods fix this cheaply, and conformal prediction provides distribution-free coverage guarantees without any assumptions about the model. See also proper scoring rules for the theory of what makes a calibration metric valid.

Mental Model

Think of calibration as a contract between the model and the user. If the model says "I am 70% sure this is a cat," then across all images where the model says 70%, roughly 70% should actually be cats. If the model systematically overestimates or underestimates its confidence, the contract is broken.

Uncertainty quantification goes further: it asks not just "how confident?" but "what is the set of plausible answers?" Conformal prediction answers this by constructing prediction sets that are guaranteed to contain the true answer with a user-specified probability.

Formal Setup and Notation

Let $f(x)$ be a classifier that outputs a probability vector $\hat{p} = f(x) \in \Delta^{K-1}$ over $K$ classes. Let $\hat{y} = \arg\max_k \hat{p}_k$ be the predicted class and $\hat{p}_{\hat{y}}$ the associated confidence.

Definition

Perfect Calibration

A classifier $f$ is perfectly calibrated if and only if for all confidence levels $p \in [0, 1]$ :

$\mathbb{P}(\hat{y} = y \mid \hat{p}_{\hat{y}} = p) = p$

That is, among all predictions made with confidence $p$ , exactly a fraction $p$ are correct.

Definition

Expected Calibration Error (ECE)

Partition predictions into $M$ bins $B_1, \ldots, B_M$ by confidence level. The expected calibration error is:

$\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left|\text{acc}(B_m) - \text{conf}(B_m)\right|$

where $\text{acc}(B_m)$ is the accuracy within bin $B_m$ and $\text{conf}(B_m)$ is the average confidence within that bin.

Definition

Top-label ECE vs Classwise ECE

The ECE above is top-label ECE: it only audits the predicted class's confidence $\hat{p}_{\hat{y}}$ against the indicator that $\hat{y} = y$ . A model may pass top-label ECE yet still misrepresent probabilities on the non-argmax classes. Classwise ECE (Nixon et al., 2019) fixes this by computing a per-class ECE, treating each class $k$ as a one-vs-rest binary problem with scores $\hat{p}_k$ and labels $\mathbb{1}[y = k]$ , then averaging across classes. Vaicenavicius et al. (2019) formalizes the distinction between full calibration (joint over the simplex) and marginal notions like top-label calibration.

Definition

Reliability Diagram

A reliability diagram (Niculescu-Mizil and Caruana, 2005) is the graphical counterpart of ECE. Bin predictions by confidence, then plot mean predicted probability on the horizontal axis against empirical accuracy on the vertical axis. Perfect calibration traces the diagonal $y = x$ ; a curve below the diagonal indicates overconfidence, a curve above indicates underconfidence. The vertical gap between the curve and the diagonal, weighted by bin mass, is exactly the ECE summand.

Definition

Brier Score

For binary outcomes with predicted probability $p_i \in [0, 1]$ and label $y_i \in \{0, 1\}$ , the Brier score (Brier, 1950) is:

$\text{BS} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2$

It is a strictly proper scoring rule: its expectation is minimized only by the true conditional probability. Murphy (1973) shows the decomposition $\text{BS} = \text{reliability} - \text{resolution} + \text{uncertainty}$ , where the reliability term is a calibration penalty and the resolution term rewards informativeness. Unlike ECE, the Brier score is binning-free.

Watch Out

ECE is a biased, binning-dependent estimator

The plug-in ECE is a biased, inconsistent estimator of the population calibration error and depends sensitively on the binning scheme. Kumar, Liang, and Ma (2019) propose a debiased ECE estimator with explicit finite-sample bias correction and convergence guarantees. Naeini, Cooper, and Hauskrecht (2015) propose Bayesian binning into quantiles (BBQ) as a robust alternative that averages over binning structures. When reporting calibration numbers, prefer Brier score, debiased ECE, or kernel calibration error over the raw plug-in ECE.

Which Uncertainty Question Are You Asking?

Question	Object you need	Good first tool	Failure mode
Are predicted probabilities numerically meaningful?	Calibrated probabilities	Reliability diagram, Brier score, temperature scaling	Good accuracy can hide overconfidence
Which label set should contain the truth with 95% coverage?	Prediction set	Split conformal prediction	Marginal coverage can hide subgroup undercoverage
Is this input far from training data?	Shift or epistemic signal	Deep ensembles, OOD detector, domain monitoring	MC dropout can underestimate uncertainty
What threshold should trigger action?	Decision rule under costs	Cost curve, precision-recall, calibrated probabilities	A calibrated model still needs a utility choice

Calibration is not a single number. It connects a probability model, a metric, and a decision. A hospital triage threshold, a weather forecast, and a classification benchmark ask different questions.

Calibration Methods

Temperature Scaling

The simplest post-hoc calibration method. Given logits $\mathbf{z} = f_{\text{logit}}(x)$ , apply a single scalar temperature $T > 0$ :

$\hat{p}_k = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}$

When $T > 1$ , the softmax output becomes softer (less confident). When $T < 1$ , it becomes sharper. Key properties:

One-parameter post-hoc fit. Only $T$ is learned; the model is frozen.
Objective. $T$ is optimized on a held-out calibration set by minimizing the negative log-likelihood (NLL). No retraining of the backbone occurs.
Argmax preserved. Dividing every logit by the same positive scalar is a monotone, order-preserving transformation of the softmax output, so $\arg\max_k \hat{p}_k$ is unchanged. Therefore top-1 accuracy is exactly unchanged; only the confidence distribution is rescaled.

Platt Scaling

For binary classification, fit a logistic regression on the scores (Platt, 1999, originally proposed for SVM outputs):

$\hat{p}(y = 1 \mid x) = \sigma(a \cdot s(x) + b)$

where $s(x)$ is the model's score (for SVMs, the margin; for neural networks, the logit) and $a, b$ are learned by minimizing NLL on a held-out set. Platt scaling is a two-parameter logistic fit. Contrast with the other standard post-hoc calibrators:

Platt scaling. Two parameters $(a, b)$ , logistic form, applied to scalar scores in binary classification (Platt, 1999).
Temperature scaling. One parameter $T$ , applied to the full logit vector in multiclass classification; a restriction of Platt scaling with $a = 1/T$ and $b = 0$ (Guo et al., 2017).
Isotonic regression. Non-parametric monotone fit; learns any non-decreasing map from score to calibrated probability (Zadrozny and Elkan, 2002). More flexible but prone to overfitting on small calibration sets.
Dirichlet calibration. Multiclass generalization that fits a full linear map on log-probabilities, capturing classwise miscalibration (Kull et al., 2019).

Proposition

Platt Scaling Calibration

Statement

If the true calibration function (mapping logits to correct-prediction probability) is a sigmoid, then Platt scaling with parameters $(a, b)$ fitted by maximum likelihood on the calibration set recovers the true calibration function as the calibration set size grows. The calibrated probabilities satisfy $\mathbb{P}(y = 1 \mid \hat{p} = p) \to p$ in probability.

Intuition

Platt scaling reparameterizes the model's confidence through a learned sigmoid. If the model's logits are monotonically related to true probability (which is common for well-trained models), a simple two-parameter fit corrects the miscalibration.

Why It Matters

Platt scaling is the standard calibration method for SVMs and is widely used as a baseline for neural network calibration. It requires only a small validation set and adds negligible computational cost.

report a correction →

Conformal Prediction

Conformal prediction is a structurally different approach: instead of calibrating point predictions, it constructs prediction sets with guaranteed coverage.

Definition

Nonconformity Score

A nonconformity score $s(x, y)$ measures how unusual the pair $(x, y)$ is relative to the model. A common choice for classification:

$s(x, y) = 1 - \hat{p}_y(x)$

where $\hat{p}_y(x)$ is the model's predicted probability for class $y$ . High scores mean the model finds this label surprising.

Theorem

Split Conformal Coverage Guarantee

Statement

Let $\{(x_1, y_1), \ldots, (x_n, y_n)\}$ be calibration data and $(x_{n+1}, y_{n+1})$ a new test point, with all $n+1$ pairs exchangeable. Compute nonconformity scores $s_i = s(x_i, y_i)$ for $i = 1, \ldots, n$ using a model fit on a disjoint training split. Define

$k^\star = \lceil (n+1)(1 - \alpha) \rceil, \qquad \hat{q} = s_{(k^\star)},$

where $s_{(1)} \leq s_{(2)} \leq \cdots \leq s_{(n)}$ are the order statistics of the calibration scores. Equivalently, $\hat{q}$ is the $k^\star / n$ sample quantile of $\{s_1, \ldots, s_n\}$ : the index $k^\star$ identifies which order statistic to pick, while the corresponding quantile level is $k^\star / n$ . When $k^\star > n$ (which occurs for very small $n$ or very small $\alpha$ ), set $\hat{q} = +\infty$ and the prediction set is trivial. Then the prediction set

$C(x_{n+1}) = \{y : s(x_{n+1}, y) \leq \hat{q}\}$

satisfies the marginal coverage bound

$\mathbb{P}(y_{n+1} \in C(x_{n+1})) \geq 1 - \alpha.$

This is the split (or inductive) conformal formulation of Vovk et al. (2005), sharpened in the form used by Romano, Patterson, and Candès (2019) and the Angelopoulos and Bates (2021) tutorial.

Intuition

If the test point is exchangeable with the calibration data, its nonconformity score is equally likely to fall anywhere in the ranking. By choosing the threshold $\hat{q}$ as the right quantile, we guarantee coverage. No assumptions about the model or data distribution are needed.

Proof Sketch

By exchangeability, the rank of $s_{n+1}$ among $\{s_1, \ldots, s_{n+1}\}$ is uniformly distributed over $\{1, \ldots, n+1\}$ . The probability that $s_{n+1}$ exceeds the $\lceil(1-\alpha)(n+1)\rceil$ -th smallest score is at most $\alpha$ . Therefore $s_{n+1} \leq \hat{q}$ with probability at least $1 - \alpha$ .

Why It Matters

Conformal prediction is the only widely-used method that provides distribution-free, finite-sample coverage guarantees. It works with any model (neural networks, random forests, LLMs) and any data type. The guarantee holds without assuming the model is correct or well-calibrated.

Failure Mode

The marginal coverage guarantee does not ensure conditional coverage: the set may be too large for easy inputs and too small for hard inputs. Achieving approximate conditional coverage is an active research area. Also, the prediction sets can be large if the underlying model is poor.

report a correction →

MC Dropout for Uncertainty

Monte Carlo (MC) dropout provides a practical approximation to predictive uncertainty. At inference time, run the model $T$ times with dropout enabled. The predictions $\{\hat{y}_1, \ldots, \hat{y}_T\}$ are treated as samples from a stochastic prediction function.

The mean gives a point prediction. The sample variance gives an uncertainty estimate:

$\widehat{\text{Var}}_{\text{MC}}(x) = \frac{1}{T} \sum_{t=1}^{T} (\hat{y}_t - \bar{y})^2, \qquad \bar{y} = \frac{1}{T} \sum_{t=1}^{T} \hat{y}_t.$

Watch Out

MC dropout variance is not a posterior variance

The quantity $\widehat{\text{Var}}_{\text{MC}}$ is a predictive Monte Carlo estimate computed over random dropout masks, not a posterior variance. Gal and Ghahramani (2016) derive a variational-Bayesian interpretation under a specific Gaussian-process prior and a fixed dropout rate; outside that construction, MC dropout variance is a heuristic. Osband (2016) distinguishes risk (inherent noise in $y \mid x$ ) from epistemic uncertainty (lack of knowledge about model parameters) and shows that MC dropout captures neither cleanly. Hron, Matthews, and Ghahramani (2017) prove that the variational family induced by dropout is miscalibrated as an approximate posterior and systematically underestimates epistemic uncertainty on out-of-distribution inputs. Treat MC dropout as a cheap baseline for predictive spread, not as a certified Bayesian method.

Deep Ensembles

Deep ensembles (Lakshminarayanan, Pritzel, and Blundell, 2017) train $M$ independent networks with different random initializations and data shuffling, then average the predicted distributions:

$\hat{p}_{\text{ens}}(y \mid x) = \frac{1}{M} \sum_{m=1}^{M} \hat{p}_{\theta_m}(y \mid x).$

Despite the naive formulation, deep ensembles are a strong uncertainty baseline: Ovadia et al. (2019), "Can you trust your model's uncertainty?" (NeurIPS), benchmarks MC dropout, temperature scaling, stochastic variational inference, and ensembles under dataset shift and finds deep ensembles consistently best on both accuracy and calibration as the shift grows. For AI-safety-adjacent deployment, ensembles provide the most reliable out-of-distribution signal per unit of engineering effort, at the cost of $M\times$ train and inference compute.

Conformal Variants

Beyond vanilla split conformal, two variants are important in practice:

Adaptive prediction sets (APS). Romano, Sesia, and Candès (2020) replace the nonconformity score $s(x, y) = 1 - \hat{p}_y(x)$ with a cumulative score that sums the sorted class probabilities down to the true class. The resulting sets adapt their size to the difficulty of the input: confident inputs receive singletons, ambiguous inputs receive larger sets. APS retains the marginal coverage guarantee and is designed to improve conditional coverage; empirically the conditional miscoverage gap is much smaller than for naive split conformal, but APS does not satisfy exact conditional coverage as a theorem (which is known to be impossible in finite samples without further assumptions, per Vovk 2012 and Lei and Wasserman 2014).
Adaptive conformal inference under distribution shift. Gibbs and Candès (2021) propose an online procedure that updates the miscoverage level $\alpha_t$ based on past coverage errors, preserving long-run coverage even when exchangeability fails due to drift. This is the method of choice when the test-time distribution is non-stationary.

Deployment Questions

Is the calibration set independent of the training and model-selection process?
Does the deployment population match the calibration population, or is there covariate shift, label shift, or concept drift?
Are you auditing top-label calibration only, or do non-argmax probabilities matter for downstream decisions?
Does the user need a probability, a ranked list, or a conformal prediction set?
What happens when the model abstains, returns a large conformal set, or gives a low-confidence answer?

Common Confusions

Watch Out

Calibration is not accuracy

A model can be perfectly calibrated but have low accuracy (it just knows what it does not know). Conversely, a model can be highly accurate but poorly calibrated (it is always overconfident). Calibration and accuracy are independent properties. You want both.

Watch Out

Conformal prediction does not fix bad models

Conformal prediction guarantees coverage regardless of model quality, but the prediction sets will be large if the model is poor. A random classifier with conformal prediction will produce prediction sets that include almost all classes. The guarantee is real, but the sets are only useful if the model has reasonable discriminative power.

Watch Out

Top-label ECE hides classwise miscalibration

Standard ECE audits only the predicted class's probability. A model can look well-calibrated under top-label ECE while systematically misrepresenting probabilities on the other $K - 1$ classes. For multiclass reliability (e.g., medical triage with non-trivial costs on the non-argmax classes), report classwise ECE (Nixon et al., 2019) alongside Brier score.

Watch Out

Conformal prediction is coverage, not probability calibration

A conformal set with 95% marginal coverage does not mean each included label has 95% probability, and it does not imply that the model's softmax scores are calibrated. Conformal prediction wraps a model with a coverage guarantee; probability calibration asks whether the scores themselves match empirical frequencies.

Summary

Calibration means predicted probabilities match empirical frequencies.
Reliability diagrams plot mean predicted probability against observed frequency; the diagonal is perfect calibration.
ECE is binning-dependent and biased; use Brier score or debiased ECE (Kumar-Liang-Ma, 2019) when reporting numbers.
Temperature scaling is a one-parameter NLL fit on logits that preserves argmax. Platt scaling is the two-parameter binary analogue. Isotonic regression is the non-parametric monotone alternative. Dirichlet calibration extends Platt to the multiclass simplex.
Conformal prediction gives distribution-free finite-sample coverage: $\mathbb{P}(y \in C(x)) \geq 1 - \alpha$ . The split threshold is the $k^\star$ -th order statistic with $k^\star = \lceil (n+1)(1-\alpha) \rceil$ .
Deep ensembles (Lakshminarayanan et al., 2017) are the strongest simple UQ baseline under shift; MC dropout variance is a heuristic that can underestimate epistemic uncertainty.
Calibration matters most in high-stakes settings: medicine, autonomy, finance, AI safety.
The deployment decision still needs a cost model or abstention policy; a calibrated probability is an input, not the final decision.

Exercises

ExerciseCore

Problem

A model makes 1000 predictions, each with confidence 0.9. Of these, 820 are correct. Is the model well-calibrated at the 90% confidence level? Compute the calibration gap for this bin.

ExerciseAdvanced

Problem

You have a calibration set of $n = 500$ examples and want conformal prediction sets with coverage $1 - \alpha = 0.95$ . Which order statistic of the nonconformity scores do you use as the threshold, and what sample quantile level does that correspond to? If you increase $n$ to 5000, how does the threshold change?

ExerciseResearch

Problem

Conformal prediction guarantees marginal coverage $\mathbb{P}(y \in C(x)) \geq 1 - \alpha$ but not conditional coverage $\mathbb{P}(y \in C(x) \mid x) \geq 1 - \alpha$ . Construct an example where marginal coverage holds at 95% but conditional coverage fails badly for a specific subgroup.

References

Canonical calibration:

Brier, "Verification of Forecasts Expressed in Terms of Probability" (Monthly Weather Review, 1950). Original Brier score.
Murphy, "A New Vector Partition of the Probability Score" (J. Applied Meteorology, 1973). Reliability-resolution-uncertainty decomposition.
Platt, "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods" (1999). Sigmoid post-hoc calibration.
Zadrozny and Elkan, "Transforming Classifier Scores into Accurate Multiclass Probability Estimates" (KDD 2002). Isotonic regression calibration.
Niculescu-Mizil and Caruana, "Predicting Good Probabilities with Supervised Learning" (ICML 2005). Reliability diagrams and calibration comparison.
Guo, Pleiss, Sun, and Weinberger, "On Calibration of Modern Neural Networks" (ICML 2017). Temperature scaling.

ECE refinements and multiclass:

Naeini, Cooper, and Hauskrecht, "Obtaining Well Calibrated Probabilities Using Bayesian Binning" (AAAI 2015). BBQ estimator.
Kull et al., "Beyond Temperature Scaling: Obtaining Well-Calibrated Multi-class Probabilities with Dirichlet Calibration" (NeurIPS 2019).
Nixon, Dusenberry, Zhang, Jerfel, and Tran, "Measuring Calibration in Deep Learning" (CVPR Workshops 2019). Classwise and adaptive ECE.
Kumar, Liang, and Ma, "Verified Uncertainty Calibration" (NeurIPS 2019). Debiased ECE with convergence analysis.
Vaicenavicius et al., "Evaluating Model Calibration in Classification" (AISTATS 2019). Full vs top-label calibration.

Conformal prediction:

Vovk, Gammerman, and Shafer, Algorithmic Learning in a Random World (Springer, 2005), Chs 2-4. Foundational conformal theory.
Romano, Patterson, and Candès, "Conformalized Quantile Regression" (NeurIPS 2019). Modern split-conformal formulation.
Romano, Sesia, and Candès, "Classification with Valid and Adaptive Coverage" (NeurIPS 2020). Adaptive prediction sets (APS).
Gibbs and Candès, "Adaptive Conformal Inference Under Distribution Shift" (NeurIPS 2021).
Angelopoulos and Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification" (arXiv:2107.07511, 2021). Tutorial statement of split conformal used here.
Barber, Candès, Ramdas, and Tibshirani, "Predictive inference with the jackknife+" (Annals of Statistics, 2021). Finite-sample predictive intervals from resampling.

Bayesian and ensemble UQ:

Gal and Ghahramani, "Dropout as a Bayesian Approximation" (ICML 2016). MC dropout.
Osband, "Risk vs Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout" (NeurIPS Workshop 2016).
Lakshminarayanan, Pritzel, and Blundell, "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles" (NeurIPS 2017).
Hron, Matthews, and Ghahramani, "Variational Gaussian Dropout is not Bayesian" (arXiv:1711.02989, 2017).
Ovadia et al., "Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift" (NeurIPS 2019).

Next Topics

The natural next steps from calibration and uncertainty:

Red-teaming and adversarial evaluation: testing whether models fail gracefully when calibration and uncertainty estimates are pushed to their limits

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Logistic Regressionlayer 1 · tier 1
Split Conformal Predictionlayer 2 · tier 1
Goodness-of-Fit Testslayer 1 · tier 2
ROC Curve and AUClayer 2 · tier 2
Bits, Nats, Perplexity, and BPBlayer 3 · tier 2

Derived topics

2

Out-of-Distribution Detectionlayer 3 · tier 2
Red-Teaming and Adversarial Evaluationlayer 5 · tier 2

Graph-backed continuations

Red-Teaming and Adversarial Evaluation Out-of-Distribution Detection