AI Safety
Calibration and Uncertainty Quantification
When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.
Why This Matters
A model that reports 90% confidence but is correct only 60% of the time is dangerous. In medical diagnosis, autonomous driving, and financial risk assessment, decisions depend not just on predictions but on how much to trust those predictions. Calibration is the property that makes predicted probabilities meaningful.
Modern neural networks are often poorly calibrated: they tend to be overconfident. A ResNet trained on ImageNet may assign 95% confidence to predictions that are correct only 80% of the time. Post-hoc calibration methods fix this cheaply, and conformal prediction provides distribution-free coverage guarantees without any assumptions about the model. See also proper scoring rules for the theory of what makes a calibration metric valid.
Mental Model
Think of calibration as a contract between the model and the user. If the model says "I am 70% sure this is a cat," then across all images where the model says 70%, roughly 70% should actually be cats. If the model systematically overestimates or underestimates its confidence, the contract is broken.
Uncertainty quantification goes further: it asks not just "how confident?" but "what is the set of plausible answers?" Conformal prediction answers this by constructing prediction sets that are guaranteed to contain the true answer with a user-specified probability.
Formal Setup and Notation
Let be a classifier that outputs a probability vector over classes. Let be the predicted class and the associated confidence.
Perfect Calibration
A classifier is perfectly calibrated if and only if for all confidence levels :
That is, among all predictions made with confidence , exactly a fraction are correct.
Expected Calibration Error (ECE)
Partition predictions into bins by confidence level. The expected calibration error is:
where is the accuracy within bin and is the average confidence within that bin.
Top-label ECE vs Classwise ECE
The ECE above is top-label ECE: it only audits the predicted class's confidence against the indicator that . A model may pass top-label ECE yet still misrepresent probabilities on the non-argmax classes. Classwise ECE (Nixon et al., 2019) fixes this by computing a per-class ECE, treating each class as a one-vs-rest binary problem with scores and labels , then averaging across classes. Vaicenavicius et al. (2019) formalizes the distinction between full calibration (joint over the simplex) and marginal notions like top-label calibration.
Reliability Diagram
A reliability diagram (Niculescu-Mizil and Caruana, 2005) is the graphical counterpart of ECE. Bin predictions by confidence, then plot mean predicted probability on the horizontal axis against empirical accuracy on the vertical axis. Perfect calibration traces the diagonal ; a curve below the diagonal indicates overconfidence, a curve above indicates underconfidence. The vertical gap between the curve and the diagonal, weighted by bin mass, is exactly the ECE summand.
Brier Score
For binary outcomes with predicted probability and label , the Brier score (Brier, 1950) is:
It is a strictly proper scoring rule: its expectation is minimized only by the true conditional probability. Murphy (1973) shows the decomposition , where the reliability term is a calibration penalty and the resolution term rewards informativeness. Unlike ECE, the Brier score is binning-free.
ECE is a biased, binning-dependent estimator
The plug-in ECE is a biased, inconsistent estimator of the population calibration error and depends sensitively on the binning scheme. Kumar, Liang, and Ma (2019) propose a debiased ECE estimator with explicit finite-sample bias correction and convergence guarantees. Naeini, Cooper, and Hauskrecht (2015) propose Bayesian binning into quantiles (BBQ) as a robust alternative that averages over binning structures. When reporting calibration numbers, prefer Brier score, debiased ECE, or kernel calibration error over the raw plug-in ECE.
Which Uncertainty Question Are You Asking?
| Question | Object you need | Good first tool | Failure mode |
|---|---|---|---|
| Are predicted probabilities numerically meaningful? | Calibrated probabilities | Reliability diagram, Brier score, temperature scaling | Good accuracy can hide overconfidence |
| Which label set should contain the truth with 95% coverage? | Prediction set | Split conformal prediction | Marginal coverage can hide subgroup undercoverage |
| Is this input far from training data? | Shift or epistemic signal | Deep ensembles, OOD detector, domain monitoring | MC dropout can underestimate uncertainty |
| What threshold should trigger action? | Decision rule under costs | Cost curve, precision-recall, calibrated probabilities | A calibrated model still needs a utility choice |
Calibration is not a single number. It connects a probability model, a metric, and a decision. A hospital triage threshold, a weather forecast, and a classification benchmark ask different questions.
Calibration Methods
Temperature Scaling
The simplest post-hoc calibration method. Given logits , apply a single scalar temperature :
When , the softmax output becomes softer (less confident). When , it becomes sharper. Key properties:
- One-parameter post-hoc fit. Only is learned; the model is frozen.
- Objective. is optimized on a held-out calibration set by minimizing the negative log-likelihood (NLL). No retraining of the backbone occurs.
- Argmax preserved. Dividing every logit by the same positive scalar is a monotone, order-preserving transformation of the softmax output, so is unchanged. Therefore top-1 accuracy is exactly unchanged; only the confidence distribution is rescaled.
Platt Scaling
For binary classification, fit a logistic regression on the scores (Platt, 1999, originally proposed for SVM outputs):
where is the model's score (for SVMs, the margin; for neural networks, the logit) and are learned by minimizing NLL on a held-out set. Platt scaling is a two-parameter logistic fit. Contrast with the other standard post-hoc calibrators:
- Platt scaling. Two parameters , logistic form, applied to scalar scores in binary classification (Platt, 1999).
- Temperature scaling. One parameter , applied to the full logit vector in multiclass classification; a restriction of Platt scaling with and (Guo et al., 2017).
- Isotonic regression. Non-parametric monotone fit; learns any non-decreasing map from score to calibrated probability (Zadrozny and Elkan, 2002). More flexible but prone to overfitting on small calibration sets.
- Dirichlet calibration. Multiclass generalization that fits a full linear map on log-probabilities, capturing classwise miscalibration (Kull et al., 2019).
Platt Scaling Calibration
Statement
If the true calibration function (mapping logits to correct-prediction probability) is a sigmoid, then Platt scaling with parameters fitted by maximum likelihood on the calibration set recovers the true calibration function as the calibration set size grows. The calibrated probabilities satisfy in probability.
Intuition
Platt scaling reparameterizes the model's confidence through a learned sigmoid. If the model's logits are monotonically related to true probability (which is common for well-trained models), a simple two-parameter fit corrects the miscalibration.
Why It Matters
Platt scaling is the standard calibration method for SVMs and is widely used as a baseline for neural network calibration. It requires only a small validation set and adds negligible computational cost.
Conformal Prediction
Conformal prediction is a structurally different approach: instead of calibrating point predictions, it constructs prediction sets with guaranteed coverage.
Nonconformity Score
A nonconformity score measures how unusual the pair is relative to the model. A common choice for classification:
where is the model's predicted probability for class . High scores mean the model finds this label surprising.
Split Conformal Coverage Guarantee
Statement
Let be calibration data and a new test point, with all pairs exchangeable. Compute nonconformity scores for using a model fit on a disjoint training split. Define
where are the order statistics of the calibration scores. Equivalently, is the sample quantile of : the index identifies which order statistic to pick, while the corresponding quantile level is . When (which occurs for very small or very small ), set and the prediction set is trivial. Then the prediction set
satisfies the marginal coverage bound
This is the split (or inductive) conformal formulation of Vovk et al. (2005), sharpened in the form used by Romano, Patterson, and Candès (2019) and the Angelopoulos and Bates (2021) tutorial.
Intuition
If the test point is exchangeable with the calibration data, its nonconformity score is equally likely to fall anywhere in the ranking. By choosing the threshold as the right quantile, we guarantee coverage. No assumptions about the model or data distribution are needed.
Proof Sketch
By exchangeability, the rank of among is uniformly distributed over . The probability that exceeds the -th smallest score is at most . Therefore with probability at least .
Why It Matters
Conformal prediction is the only widely-used method that provides distribution-free, finite-sample coverage guarantees. It works with any model (neural networks, random forests, LLMs) and any data type. The guarantee holds without assuming the model is correct or well-calibrated.
Failure Mode
The marginal coverage guarantee does not ensure conditional coverage: the set may be too large for easy inputs and too small for hard inputs. Achieving approximate conditional coverage is an active research area. Also, the prediction sets can be large if the underlying model is poor.
MC Dropout for Uncertainty
Monte Carlo (MC) dropout provides a practical approximation to predictive uncertainty. At inference time, run the model times with dropout enabled. The predictions are treated as samples from a stochastic prediction function.
The mean gives a point prediction. The sample variance gives an uncertainty estimate:
MC dropout variance is not a posterior variance
The quantity is a predictive Monte Carlo estimate computed over random dropout masks, not a posterior variance. Gal and Ghahramani (2016) derive a variational-Bayesian interpretation under a specific Gaussian-process prior and a fixed dropout rate; outside that construction, MC dropout variance is a heuristic. Osband (2016) distinguishes risk (inherent noise in ) from epistemic uncertainty (lack of knowledge about model parameters) and shows that MC dropout captures neither cleanly. Hron, Matthews, and Ghahramani (2017) prove that the variational family induced by dropout is miscalibrated as an approximate posterior and systematically underestimates epistemic uncertainty on out-of-distribution inputs. Treat MC dropout as a cheap baseline for predictive spread, not as a certified Bayesian method.
Deep Ensembles
Deep ensembles (Lakshminarayanan, Pritzel, and Blundell, 2017) train independent networks with different random initializations and data shuffling, then average the predicted distributions:
Despite the naive formulation, deep ensembles are a strong uncertainty baseline: Ovadia et al. (2019), "Can you trust your model's uncertainty?" (NeurIPS), benchmarks MC dropout, temperature scaling, stochastic variational inference, and ensembles under dataset shift and finds deep ensembles consistently best on both accuracy and calibration as the shift grows. For AI-safety-adjacent deployment, ensembles provide the most reliable out-of-distribution signal per unit of engineering effort, at the cost of train and inference compute.
Conformal Variants
Beyond vanilla split conformal, two variants are important in practice:
- Adaptive prediction sets (APS). Romano, Sesia, and Candès (2020) replace the nonconformity score with a cumulative score that sums the sorted class probabilities down to the true class. The resulting sets adapt their size to the difficulty of the input: confident inputs receive singletons, ambiguous inputs receive larger sets. APS retains the marginal coverage guarantee and is designed to improve conditional coverage; empirically the conditional miscoverage gap is much smaller than for naive split conformal, but APS does not satisfy exact conditional coverage as a theorem (which is known to be impossible in finite samples without further assumptions, per Vovk 2012 and Lei and Wasserman 2014).
- Adaptive conformal inference under distribution shift. Gibbs and Candès (2021) propose an online procedure that updates the miscoverage level based on past coverage errors, preserving long-run coverage even when exchangeability fails due to drift. This is the method of choice when the test-time distribution is non-stationary.
Deployment Questions
- Is the calibration set independent of the training and model-selection process?
- Does the deployment population match the calibration population, or is there covariate shift, label shift, or concept drift?
- Are you auditing top-label calibration only, or do non-argmax probabilities matter for downstream decisions?
- Does the user need a probability, a ranked list, or a conformal prediction set?
- What happens when the model abstains, returns a large conformal set, or gives a low-confidence answer?
Common Confusions
Calibration is not accuracy
A model can be perfectly calibrated but have low accuracy (it just knows what it does not know). Conversely, a model can be highly accurate but poorly calibrated (it is always overconfident). Calibration and accuracy are independent properties. You want both.
Conformal prediction does not fix bad models
Conformal prediction guarantees coverage regardless of model quality, but the prediction sets will be large if the model is poor. A random classifier with conformal prediction will produce prediction sets that include almost all classes. The guarantee is real, but the sets are only useful if the model has reasonable discriminative power.
Top-label ECE hides classwise miscalibration
Standard ECE audits only the predicted class's probability. A model can look well-calibrated under top-label ECE while systematically misrepresenting probabilities on the other classes. For multiclass reliability (e.g., medical triage with non-trivial costs on the non-argmax classes), report classwise ECE (Nixon et al., 2019) alongside Brier score.
Conformal prediction is coverage, not probability calibration
A conformal set with 95% marginal coverage does not mean each included label has 95% probability, and it does not imply that the model's softmax scores are calibrated. Conformal prediction wraps a model with a coverage guarantee; probability calibration asks whether the scores themselves match empirical frequencies.
Summary
- Calibration means predicted probabilities match empirical frequencies.
- Reliability diagrams plot mean predicted probability against observed frequency; the diagonal is perfect calibration.
- ECE is binning-dependent and biased; use Brier score or debiased ECE (Kumar-Liang-Ma, 2019) when reporting numbers.
- Temperature scaling is a one-parameter NLL fit on logits that preserves argmax. Platt scaling is the two-parameter binary analogue. Isotonic regression is the non-parametric monotone alternative. Dirichlet calibration extends Platt to the multiclass simplex.
- Conformal prediction gives distribution-free finite-sample coverage: . The split threshold is the -th order statistic with .
- Deep ensembles (Lakshminarayanan et al., 2017) are the strongest simple UQ baseline under shift; MC dropout variance is a heuristic that can underestimate epistemic uncertainty.
- Calibration matters most in high-stakes settings: medicine, autonomy, finance, AI safety.
- The deployment decision still needs a cost model or abstention policy; a calibrated probability is an input, not the final decision.
Exercises
Problem
A model makes 1000 predictions, each with confidence 0.9. Of these, 820 are correct. Is the model well-calibrated at the 90% confidence level? Compute the calibration gap for this bin.
Problem
You have a calibration set of examples and want conformal prediction sets with coverage . Which order statistic of the nonconformity scores do you use as the threshold, and what sample quantile level does that correspond to? If you increase to 5000, how does the threshold change?
Problem
Conformal prediction guarantees marginal coverage but not conditional coverage . Construct an example where marginal coverage holds at 95% but conditional coverage fails badly for a specific subgroup.
References
Canonical calibration:
- Brier, "Verification of Forecasts Expressed in Terms of Probability" (Monthly Weather Review, 1950). Original Brier score.
- Murphy, "A New Vector Partition of the Probability Score" (J. Applied Meteorology, 1973). Reliability-resolution-uncertainty decomposition.
- Platt, "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods" (1999). Sigmoid post-hoc calibration.
- Zadrozny and Elkan, "Transforming Classifier Scores into Accurate Multiclass Probability Estimates" (KDD 2002). Isotonic regression calibration.
- Niculescu-Mizil and Caruana, "Predicting Good Probabilities with Supervised Learning" (ICML 2005). Reliability diagrams and calibration comparison.
- Guo, Pleiss, Sun, and Weinberger, "On Calibration of Modern Neural Networks" (ICML 2017). Temperature scaling.
ECE refinements and multiclass:
- Naeini, Cooper, and Hauskrecht, "Obtaining Well Calibrated Probabilities Using Bayesian Binning" (AAAI 2015). BBQ estimator.
- Kull et al., "Beyond Temperature Scaling: Obtaining Well-Calibrated Multi-class Probabilities with Dirichlet Calibration" (NeurIPS 2019).
- Nixon, Dusenberry, Zhang, Jerfel, and Tran, "Measuring Calibration in Deep Learning" (CVPR Workshops 2019). Classwise and adaptive ECE.
- Kumar, Liang, and Ma, "Verified Uncertainty Calibration" (NeurIPS 2019). Debiased ECE with convergence analysis.
- Vaicenavicius et al., "Evaluating Model Calibration in Classification" (AISTATS 2019). Full vs top-label calibration.
Conformal prediction:
- Vovk, Gammerman, and Shafer, Algorithmic Learning in a Random World (Springer, 2005), Chs 2-4. Foundational conformal theory.
- Romano, Patterson, and Candès, "Conformalized Quantile Regression" (NeurIPS 2019). Modern split-conformal formulation.
- Romano, Sesia, and Candès, "Classification with Valid and Adaptive Coverage" (NeurIPS 2020). Adaptive prediction sets (APS).
- Gibbs and Candès, "Adaptive Conformal Inference Under Distribution Shift" (NeurIPS 2021).
- Angelopoulos and Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification" (arXiv:2107.07511, 2021). Tutorial statement of split conformal used here.
- Barber, Candès, Ramdas, and Tibshirani, "Predictive inference with the jackknife+" (Annals of Statistics, 2021). Finite-sample predictive intervals from resampling.
Bayesian and ensemble UQ:
- Gal and Ghahramani, "Dropout as a Bayesian Approximation" (ICML 2016). MC dropout.
- Osband, "Risk vs Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout" (NeurIPS Workshop 2016).
- Lakshminarayanan, Pritzel, and Blundell, "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles" (NeurIPS 2017).
- Hron, Matthews, and Ghahramani, "Variational Gaussian Dropout is not Bayesian" (arXiv:1711.02989, 2017).
- Ovadia et al., "Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift" (NeurIPS 2019).
Next Topics
The natural next steps from calibration and uncertainty:
- Red-teaming and adversarial evaluation: testing whether models fail gracefully when calibration and uncertainty estimates are pushed to their limits
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Logistic Regressionlayer 1 · tier 1
- Split Conformal Predictionlayer 2 · tier 1
- Goodness-of-Fit Testslayer 1 · tier 2
- ROC Curve and AUClayer 2 · tier 2
- Bits, Nats, Perplexity, and BPBlayer 3 · tier 2
Derived topics
2- Out-of-Distribution Detectionlayer 3 · tier 2
- Red-Teaming and Adversarial Evaluationlayer 5 · tier 2
Graph-backed continuations