Sampling MCMC
Burn-in and Convergence Diagnostics
Burn-in is only the first filter. Modern MCMC trust comes from split rank-normalized R-hat, bulk and tail ESS, trace behavior, and sampler-specific warnings like divergences.
Prerequisites
Why This Matters
Four chains from overdispersed starts: trace plots (top) and running Gelman-Rubin R-hat (bottom)
MCMC gives you samples from a target distribution, but only eventually. The chain starts from some arbitrary initial state and needs time to "forget" where it started. If you use samples from before the chain has converged, your estimates will be biased toward the initial conditions.
Convergence diagnostics answer the most important practical question in MCMC: are my samples trustworthy? There is no perfect answer, but there is a modern diagnostic bundle that catches the failures that matter in practice: chains disagreeing with each other, chains mixing too slowly, chains missing the tails, and gradient-based samplers quietly breaking on bad geometry.
Mental Model
Imagine dropping a ball into a complex bowl. Initially, the ball rolls around chaotically depending on where you dropped it. After enough time, it settles into a pattern that depends only on the shape of the bowl, not on the starting position. Burn-in is the period of chaotic rolling that you throw away. Convergence diagnostics help you judge when the ball has "settled."
The hard truth: you can never prove a chain has converged. You can only detect certain kinds of non-convergence.
Formal Setup and Notation
Let be a Markov chain with stationary distribution . We want to estimate using the ergodic average:
where is the burn-in period (number of initial samples discarded).
Burn-in
The burn-in period is the initial segment of the Markov chain that is discarded before computing estimates. The purpose is to reduce bias from the initial state . After burn-in, the chain should be approximately sampling from the stationary distribution .
There is no universal formula for how long burn-in should be. It depends on how far is from the typical set of , the mixing rate of the chain, and the geometry of .
Mixing Time
The mixing time is the smallest such that the chain's distribution is within total variation distance of , regardless of the starting state:
This is a property of the chain, not the initial state. A chain with small mixing time forgets its initial state quickly.
Effective Sample Size (ESS)
MCMC samples are autocorrelated, so samples carry less information than independent samples. The effective sample size is:
where is the lag- autocorrelation. If consecutive samples are highly correlated, ESS can be much smaller than . The standard error of the MCMC estimate is approximately .
Core Definitions
Trace plots show vs . A well-mixing chain looks like white noise: it moves rapidly through the support of with no visible trends or long excursions. A poorly mixing chain shows long plateaus (the chain is stuck) or slow drifts (the chain has not reached stationarity).
Autocorrelation plots show vs lag . Ideally, autocorrelations decay rapidly to zero. If they remain high at large lags, the chain is mixing slowly and you need many more samples (or a better sampler).
Running mean plots show the cumulative mean of vs . If the chain has converged, the running mean stabilizes. If it drifts, the chain has not mixed.
The Modern Diagnostic Bundle
Modern Bayesian software does not treat convergence as "burn-in plus one value." The practical bundle is:
- Split, rank-normalized : catch chains that still disagree in location or scale even after long runs. This is the default modern replacement for the original Gelman-Rubin statistic.
- Bulk ESS: ask whether the central mass of the posterior has enough effectively independent draws for means, variances, and ordinary posterior summaries.
- Tail ESS: ask whether the tails are explored well enough for intervals, extreme quantiles, and rare-event posterior statements.
- Trace and pair plots: show whether the chain is still drifting, switching modes, or getting trapped along a curved ridge.
- Sampler-specific warnings: for Hamiltonian Monte Carlo and NUTS, divergences and low energy exploration often reveal geometric failure even when scalar summaries look superficially fine.
The deeper lesson is that diagnostics come in layers. Burn-in fights initialization bias. checks cross-chain agreement. ESS checks Monte Carlo precision. Divergences and energy diagnostics check whether the sampler itself is numerically respecting the target geometry.
Main Theorems
Classical Gelman-Rubin R-hat Diagnostic
Statement
Run chains, each of length (after burn-in), from overdispersed starting points. Let be the mean of chain and the grand mean. Define:
Between-chain variance:
Within-chain variance: where is the sample variance of chain
The potential scale reduction factor is:
If the chains have converged, . Values of suggest non-convergence.
Intuition
If all chains have converged to , they should have similar means and variances. The between-chain variance captures disagreement among chains. The within-chain variance captures variation within each chain. If , the chains disagree, suggesting they have not converged. measures the ratio of total estimated variance to within-chain variance.
Proof Sketch
The numerator is an overestimate of (because includes both true variance and between-chain disagreement), and is an underestimate (because finite chains have not explored the full distribution). Their ratio exceeds 1 when chains have not mixed. As , both converge to and .
Why It Matters
is the most widely used convergence diagnostic. Every Bayesian software package (Stan, PyMC, JAGS) reports it. The rule (or the older threshold ) is standard practice. It catches the most common failure mode: chains stuck in different modes of a multimodal distribution.
Failure Mode
does not guarantee convergence. All chains could be stuck in the same mode while missing others. If the starting points are not overdispersed enough, the diagnostic has no power. Also, is defined for scalar quantities; for multivariate targets, you need to check for each marginal and for derived quantities.
The original above is historically important, but modern packages usually report split, rank-normalized (Vehtari et al. 2021). That update matters because the classic statistic can miss heavy tails, skewness, and chains that agree on the mean while disagreeing elsewhere. In practice, if a package exposes only one convergence number, you want the modern split/rank-normalized version, not the 1992 formula by itself.
Divergences, Geometry, and Why Burn-In Is Not Enough
For random-walk samplers, non-convergence usually appears as sticky traces, large autocorrelations, or chains parked in different modes. For HMC-style samplers, there is another failure mode: the numerical integrator can start breaking at the same parts of the posterior over and over again.
A divergence is not "just another rejected proposal." It is a sign that the leapfrog integrator encountered geometry too sharp for the current step size or parameterization. Neal's funnel and weakly identified hierarchical models are the canonical examples. In those settings, more burn-in is not the fix. The fix is usually better geometry: reparameterization, stronger prior structure, or a more faithful mass matrix adaptation. That is why pages like No-U-Turn Sampler and Neal's Funnel and Centered vs. Non-Centered Hierarchical Models matter downstream of diagnostics.
Canonical Examples
Bimodal target
Suppose is a mixture of two well-separated Gaussians. A single Metropolis-Hastings chain may get stuck in one mode for the entire run. The trace plot shows a flat line, and the running mean converges to the mode's mean rather than the mixture mean. Running 4 chains from different starting points reveals the problem: chains in different modes have different means, so .
Well-mixing chain
For a 2D Gaussian target with moderate correlation, a well-tuned random-walk Metropolis chain with acceptance rate around 0.234 shows rapid mixing. The trace plot looks like white noise, autocorrelations decay to zero by lag 20, and from 4 chains is below 1.01 after a few hundred iterations.
Common Confusions
Burn-in is not a fixed fraction
Some textbooks suggest discarding the first 50% of samples as burn-in. This is wasteful for well-mixing chains and insufficient for slowly-mixing chains. Use diagnostics to judge when the chain has converged, then discard the pre-convergence samples. There is no one-size-fits-all burn-in length.
Low R-hat does not mean the chain has converged
is necessary but not sufficient for convergence. It is possible for all chains to be stuck in the same local mode, giving while the chain has not explored the full target. Always supplement with trace plots, autocorrelation analysis, and domain knowledge.
More samples is not always the answer
If the chain is not mixing, running it longer just gives you more samples from the wrong distribution. Fix the sampler first (better proposal, HMC, reparameterization), then run longer.
Burn-in is not the cure for bad geometry
If a chain shows persistent divergences, very low ESS, or clear funnel-like pair plots, discarding a larger prefix does not solve the problem. Burn-in can remove initialization bias, but it cannot repair a centered hierarchical parameterization that keeps forcing the sampler through a narrow neck. That is why diagnostics should trigger model surgery, not just a longer run.
Summary
- Burn-in removes initialization bias; it does not certify convergence
- Modern practice uses split rank-normalized plus bulk and tail ESS, not a single scalar heuristic
- Classical compares within-chain to between-chain variance and should be close to 1, but it is not the whole story
- ESS translates autocorrelation into Monte Carlo precision
- Trace plots and pair plots still matter because they expose modes, drift, and funnel geometry that scalar summaries can miss
- Divergences are geometry alarms, not cosmetic warnings
- You can never prove convergence from finite output; you can only assemble enough evidence that obvious failure modes are absent
Exercises
Problem
You run 4 MCMC chains for 10,000 iterations each. The within-chain variance is and the between-chain variance is . Compute and interpret.
Problem
A chain of 10,000 samples has lag-1 autocorrelation and autocorrelations decay geometrically: . What is the effective sample size?
References
Canonical:
- Gelman & Rubin, "Inference from Iterative Simulation Using Multiple Sequences," Statistical Science (1992)
- Geyer, "Practical Markov Chain Monte Carlo," Statistical Science (1992)
Current / practical:
- Vehtari, Gelman, Simpson, Carpenter, Burkner, "Rank-Normalization, Folding, and Localization: An Improved R-hat for Assessing Convergence of MCMC," Bayesian Analysis 16(2) (2021), 667-718. Modern split rank-normalized , bulk ESS, and tail ESS.
- Stan Development Team, Stan Reference Manual, diagnostics chapters. Practical source for divergences, energy diagnostics, ESS, and in modern HMC workflows.
- Betancourt, "A Conceptual Introduction to Hamiltonian Monte Carlo" (2017, arXiv:1701.02434). Why divergences are geometry warnings rather than generic MCMC noise.
- Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7.
- Brooks et al., Handbook of MCMC (2011), Chapters 1-5.
- Levin, Peres, Wilmer, Markov Chains and Mixing Times (2nd edition, 2017), Chapters 4-5.
Next Topics
The natural next steps from burn-in and convergence diagnostics:
- Gibbs Sampling: what slow autocorrelation looks like in a conditional-update sampler
- Hamiltonian Monte Carlo: a sampler that mixes much faster when the geometry is smooth
- No-U-Turn Sampler and Neal's Funnel: why good diagnostics still matter even after NUTS chooses path length for you
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
10- Gibbs Samplinglayer 2 · tier 1
- Markov Chain Monte Carlolayer 2 · tier 1
- Metropolis-Hastings Algorithmlayer 2 · tier 1
- Markov Chains and Steady Statelayer 1 · tier 2
- Variance Reduction Techniqueslayer 2 · tier 2
Derived topics
1- No-U-Turn Sampler and Neal's Funnellayer 3 · tier 2
Graph-backed continuations