Skip to main content

Sampling MCMC

Burn-in and Convergence Diagnostics

Burn-in is only the first filter. Modern MCMC trust comes from split rank-normalized R-hat, bulk and tail ESS, trace behavior, and sampler-specific warnings like divergences.

CoreTier 2StableSupporting~45 min

Why This Matters

Four chains from overdispersed starts: trace plots (top) and running Gelman-Rubin R-hat (bottom)

-303chain valueburn-in (discard)mixed: samples usable1.02.03.04.0R̂ = 1.01 (Vehtari et al. 2021)02505007501000iterationchain 1chain 2chain 3chain 4

MCMC gives you samples from a target distribution, but only eventually. The chain starts from some arbitrary initial state and needs time to "forget" where it started. If you use samples from before the chain has converged, your estimates will be biased toward the initial conditions.

Convergence diagnostics answer the most important practical question in MCMC: are my samples trustworthy? There is no perfect answer, but there is a modern diagnostic bundle that catches the failures that matter in practice: chains disagreeing with each other, chains mixing too slowly, chains missing the tails, and gradient-based samplers quietly breaking on bad geometry.

Mental Model

Imagine dropping a ball into a complex bowl. Initially, the ball rolls around chaotically depending on where you dropped it. After enough time, it settles into a pattern that depends only on the shape of the bowl, not on the starting position. Burn-in is the period of chaotic rolling that you throw away. Convergence diagnostics help you judge when the ball has "settled."

The hard truth: you can never prove a chain has converged. You can only detect certain kinds of non-convergence.

Formal Setup and Notation

Let {Xt}t=0T\{X_t\}_{t=0}^{T} be a Markov chain with stationary distribution π\pi. We want to estimate Eπ[f(X)]\mathbb{E}_\pi[f(X)] using the ergodic average:

μ^T=1TBt=B+1Tf(Xt)\hat{\mu}_T = \frac{1}{T - B} \sum_{t=B+1}^{T} f(X_t)

where BB is the burn-in period (number of initial samples discarded).

Definition

Burn-in

The burn-in period is the initial segment of the Markov chain that is discarded before computing estimates. The purpose is to reduce bias from the initial state X0X_0. After burn-in, the chain should be approximately sampling from the stationary distribution π\pi.

There is no universal formula for how long burn-in should be. It depends on how far X0X_0 is from the typical set of π\pi, the mixing rate of the chain, and the geometry of π\pi.

Definition

Mixing Time

The mixing time τmix(ϵ)\tau_{\text{mix}}(\epsilon) is the smallest tt such that the chain's distribution is within total variation distance ϵ\epsilon of π\pi, regardless of the starting state:

τmix(ϵ)=min{t:supX0Pt(X0,)πTVϵ}\tau_{\text{mix}}(\epsilon) = \min\{t : \sup_{X_0} \|P^t(X_0, \cdot) - \pi\|_{\text{TV}} \leq \epsilon\}

This is a property of the chain, not the initial state. A chain with small mixing time forgets its initial state quickly.

Definition

Effective Sample Size (ESS)

MCMC samples are autocorrelated, so TT samples carry less information than TT independent samples. The effective sample size is:

ESS=T1+2k=1ρk\text{ESS} = \frac{T}{1 + 2\sum_{k=1}^{\infty} \rho_k}

where ρk=Corr(f(Xt),f(Xt+k))\rho_k = \mathrm{Corr}(f(X_t), f(X_{t+k})) is the lag-kk autocorrelation. If consecutive samples are highly correlated, ESS can be much smaller than TT. The standard error of the MCMC estimate is approximately σ/ESS\sigma / \sqrt{\text{ESS}}.

Core Definitions

Trace plots show f(Xt)f(X_t) vs tt. A well-mixing chain looks like white noise: it moves rapidly through the support of π\pi with no visible trends or long excursions. A poorly mixing chain shows long plateaus (the chain is stuck) or slow drifts (the chain has not reached stationarity).

Autocorrelation plots show ρk\rho_k vs lag kk. Ideally, autocorrelations decay rapidly to zero. If they remain high at large lags, the chain is mixing slowly and you need many more samples (or a better sampler).

Running mean plots show the cumulative mean of f(Xt)f(X_t) vs tt. If the chain has converged, the running mean stabilizes. If it drifts, the chain has not mixed.

The Modern Diagnostic Bundle

Modern Bayesian software does not treat convergence as "burn-in plus one R^\hat{R} value." The practical bundle is:

  • Split, rank-normalized R^\hat{R}: catch chains that still disagree in location or scale even after long runs. This is the default modern replacement for the original Gelman-Rubin statistic.
  • Bulk ESS: ask whether the central mass of the posterior has enough effectively independent draws for means, variances, and ordinary posterior summaries.
  • Tail ESS: ask whether the tails are explored well enough for intervals, extreme quantiles, and rare-event posterior statements.
  • Trace and pair plots: show whether the chain is still drifting, switching modes, or getting trapped along a curved ridge.
  • Sampler-specific warnings: for Hamiltonian Monte Carlo and NUTS, divergences and low energy exploration often reveal geometric failure even when scalar summaries look superficially fine.

The deeper lesson is that diagnostics come in layers. Burn-in fights initialization bias. R^\hat{R} checks cross-chain agreement. ESS checks Monte Carlo precision. Divergences and energy diagnostics check whether the sampler itself is numerically respecting the target geometry.

Main Theorems

Proposition

Classical Gelman-Rubin R-hat Diagnostic

Statement

Run M2M \geq 2 chains, each of length NN (after burn-in), from overdispersed starting points. Let θˉm\bar{\theta}_m be the mean of chain mm and θˉ\bar{\theta} the grand mean. Define:

Between-chain variance: B=NM1m=1M(θˉmθˉ)2B = \frac{N}{M-1}\sum_{m=1}^{M}(\bar{\theta}_m - \bar{\theta})^2

Within-chain variance: W=1Mm=1Msm2W = \frac{1}{M}\sum_{m=1}^{M} s_m^2 where sm2s_m^2 is the sample variance of chain mm

The potential scale reduction factor is:

R^=N1NW+1NBW\hat{R} = \sqrt{\frac{\frac{N-1}{N}W + \frac{1}{N}B}{W}}

If the chains have converged, R^1\hat{R} \approx 1. Values of R^>1.01\hat{R} > 1.01 suggest non-convergence.

Intuition

If all chains have converged to π\pi, they should have similar means and variances. The between-chain variance BB captures disagreement among chains. The within-chain variance WW captures variation within each chain. If BWB \gg W, the chains disagree, suggesting they have not converged. R^\hat{R} measures the ratio of total estimated variance to within-chain variance.

Proof Sketch

The numerator N1NW+1NB\frac{N-1}{N}W + \frac{1}{N}B is an overestimate of Varπ(f)\mathrm{Var}_\pi(f) (because BB includes both true variance and between-chain disagreement), and WW is an underestimate (because finite chains have not explored the full distribution). Their ratio exceeds 1 when chains have not mixed. As NN \to \infty, both converge to Varπ(f)\mathrm{Var}_\pi(f) and R^1\hat{R} \to 1.

Why It Matters

R^\hat{R} is the most widely used convergence diagnostic. Every Bayesian software package (Stan, PyMC, JAGS) reports it. The rule R^<1.01\hat{R} < 1.01 (or the older threshold 1.11.1) is standard practice. It catches the most common failure mode: chains stuck in different modes of a multimodal distribution.

Failure Mode

R^1\hat{R} \approx 1 does not guarantee convergence. All chains could be stuck in the same mode while missing others. If the starting points are not overdispersed enough, the diagnostic has no power. Also, R^\hat{R} is defined for scalar quantities; for multivariate targets, you need to check R^\hat{R} for each marginal and for derived quantities.

The original R^\hat{R} above is historically important, but modern packages usually report split, rank-normalized R^\hat{R} (Vehtari et al. 2021). That update matters because the classic statistic can miss heavy tails, skewness, and chains that agree on the mean while disagreeing elsewhere. In practice, if a package exposes only one convergence number, you want the modern split/rank-normalized version, not the 1992 formula by itself.

Divergences, Geometry, and Why Burn-In Is Not Enough

For random-walk samplers, non-convergence usually appears as sticky traces, large autocorrelations, or chains parked in different modes. For HMC-style samplers, there is another failure mode: the numerical integrator can start breaking at the same parts of the posterior over and over again.

A divergence is not "just another rejected proposal." It is a sign that the leapfrog integrator encountered geometry too sharp for the current step size or parameterization. Neal's funnel and weakly identified hierarchical models are the canonical examples. In those settings, more burn-in is not the fix. The fix is usually better geometry: reparameterization, stronger prior structure, or a more faithful mass matrix adaptation. That is why pages like No-U-Turn Sampler and Neal's Funnel and Centered vs. Non-Centered Hierarchical Models matter downstream of diagnostics.

Canonical Examples

Example

Bimodal target

Suppose π\pi is a mixture of two well-separated Gaussians. A single Metropolis-Hastings chain may get stuck in one mode for the entire run. The trace plot shows a flat line, and the running mean converges to the mode's mean rather than the mixture mean. Running 4 chains from different starting points reveals the problem: chains in different modes have different means, so R^1\hat{R} \gg 1.

Example

Well-mixing chain

For a 2D Gaussian target with moderate correlation, a well-tuned random-walk Metropolis chain with acceptance rate around 0.234 shows rapid mixing. The trace plot looks like white noise, autocorrelations decay to zero by lag 20, and R^\hat{R} from 4 chains is below 1.01 after a few hundred iterations.

Common Confusions

Watch Out

Burn-in is not a fixed fraction

Some textbooks suggest discarding the first 50% of samples as burn-in. This is wasteful for well-mixing chains and insufficient for slowly-mixing chains. Use diagnostics to judge when the chain has converged, then discard the pre-convergence samples. There is no one-size-fits-all burn-in length.

Watch Out

Low R-hat does not mean the chain has converged

R^<1.01\hat{R} < 1.01 is necessary but not sufficient for convergence. It is possible for all chains to be stuck in the same local mode, giving R^1\hat{R} \approx 1 while the chain has not explored the full target. Always supplement R^\hat{R} with trace plots, autocorrelation analysis, and domain knowledge.

Watch Out

More samples is not always the answer

If the chain is not mixing, running it longer just gives you more samples from the wrong distribution. Fix the sampler first (better proposal, HMC, reparameterization), then run longer.

Watch Out

Burn-in is not the cure for bad geometry

If a chain shows persistent divergences, very low ESS, or clear funnel-like pair plots, discarding a larger prefix does not solve the problem. Burn-in can remove initialization bias, but it cannot repair a centered hierarchical parameterization that keeps forcing the sampler through a narrow neck. That is why diagnostics should trigger model surgery, not just a longer run.

Summary

  • Burn-in removes initialization bias; it does not certify convergence
  • Modern practice uses split rank-normalized R^\hat{R} plus bulk and tail ESS, not a single scalar heuristic
  • Classical R^\hat{R} compares within-chain to between-chain variance and should be close to 1, but it is not the whole story
  • ESS translates autocorrelation into Monte Carlo precision
  • Trace plots and pair plots still matter because they expose modes, drift, and funnel geometry that scalar summaries can miss
  • Divergences are geometry alarms, not cosmetic warnings
  • You can never prove convergence from finite output; you can only assemble enough evidence that obvious failure modes are absent

Exercises

ExerciseCore

Problem

You run 4 MCMC chains for 10,000 iterations each. The within-chain variance is W=2.0W = 2.0 and the between-chain variance is B=20.0B = 20.0. Compute R^\hat{R} and interpret.

ExerciseAdvanced

Problem

A chain of 10,000 samples has lag-1 autocorrelation ρ1=0.95\rho_1 = 0.95 and autocorrelations decay geometrically: ρk=0.95k\rho_k = 0.95^k. What is the effective sample size?

References

Canonical:

  • Gelman & Rubin, "Inference from Iterative Simulation Using Multiple Sequences," Statistical Science (1992)
  • Geyer, "Practical Markov Chain Monte Carlo," Statistical Science (1992)

Current / practical:

  • Vehtari, Gelman, Simpson, Carpenter, Burkner, "Rank-Normalization, Folding, and Localization: An Improved R-hat for Assessing Convergence of MCMC," Bayesian Analysis 16(2) (2021), 667-718. Modern split rank-normalized R^\hat{R}, bulk ESS, and tail ESS.
  • Stan Development Team, Stan Reference Manual, diagnostics chapters. Practical source for divergences, energy diagnostics, ESS, and R^\hat{R} in modern HMC workflows.
  • Betancourt, "A Conceptual Introduction to Hamiltonian Monte Carlo" (2017, arXiv:1701.02434). Why divergences are geometry warnings rather than generic MCMC noise.
  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7.
  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5.
  • Levin, Peres, Wilmer, Markov Chains and Mixing Times (2nd edition, 2017), Chapters 4-5.

Next Topics

The natural next steps from burn-in and convergence diagnostics:

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

10

Derived topics

1

Graph-backed continuations