Sampling MCMC
Metropolis-Hastings Algorithm
The foundational MCMC algorithm: build a Markov chain with the right stationary distribution by accepting or rejecting proposals according to a carefully balanced ratio. The real story is not only correctness, but also proposal geometry, diagnostics, and where plain MH becomes painfully slow.
Prerequisites
Why This Matters
If you need to sample from a probability distribution that you can evaluate only up to a normalizing constant, you are already in Metropolis-Hastings territory. That describes most posterior distributions in Bayesian statistics. The Metropolis-Hastings algorithm is the foundational construction that makes such problems computationally tractable. It sits underneath Gibbs sampling, motivates why Hamiltonian Monte Carlo had to be invented, and still provides the cleanest introduction to how MCMC achieves correctness at all.
Before MH, Bayesian inference was largely restricted to conjugate models where closed-form posteriors exist. MH broke that barrier and made Bayesian computation general-purpose. The modern lesson is slightly different: MH gives you the correctness template, but its practical efficiency depends almost entirely on proposal design and posterior geometry.
Mental Model
You are exploring a landscape where the height at each point represents the (unnormalized) probability density of your target distribution. You stand at some point . A friend proposes a new location according to some rule. You evaluate how much more (or less) probable is compared to . If is more probable, you always move there. If is less probable, you move there with a probability proportional to how much less probable it is. Over time, you spend more time in high-probability regions, exactly in proportion to their probability.
Formal Setup and Notation
Let be the target distribution we wish to sample from. We assume we can evaluate up to a normalizing constant; that is, we can compute where for an unknown constant .
Let be a proposal distribution: a conditional density from which we can easily draw candidates.
Proposal Distribution
The proposal distribution is a conditional density that, given the current state , generates a candidate next state . The proposal must be chosen so that it is easy to sample from and (ideally) explores the target distribution efficiently. Common choices include:
- Random walk: . propose near the current state
- Independence sampler: . propose independently of the current state
Acceptance Ratio
The acceptance ratio (or acceptance probability) for moving from state to proposed state is:
When the proposal is symmetric, i.e., , this simplifies to the original Metropolis ratio:
Metropolis-Hastings Algorithm
Given target , proposal , and initial state :
- Propose: Draw
- Compute the acceptance ratio using the unnormalized target and proposal densities
- Accept/Reject: Draw . If , set ; otherwise set
- Repeat from step 1
We use (unnormalized) because cancels in the ratio.
Why the Algorithm Works
The key insight is that MH constructs a Markov chain whose transition kernel satisfies detailed balance with respect to . Detailed balance is a sufficient condition for to be the stationary distribution of the chain.
Detailed Balance
A Markov chain with transition kernel satisfies detailed balance with respect to if and only if:
for all states . This says: the probability flow from to under equals the flow from to . Detailed balance implies is stationary (integrate both sides over ), but is stronger: it says the chain is reversible.
Main Theorems
MH Satisfies Detailed Balance
Statement
The Metropolis-Hastings transition kernel
satisfies detailed balance with respect to :
Therefore is a stationary distribution of the MH chain.
Intuition
The acceptance ratio is specifically designed so that the "excess flow" from to is throttled back to match the flow in the reverse direction. If , then the move from to is accepted with probability 1, and the reverse move is accepted with a probability less than 1, exactly the right amount less to balance the flows.
Proof Sketch
Without loss of generality, assume .
Then and .
The left side of the detailed balance equation becomes: .
The right side becomes: .
The two sides are equal. The symmetric case follows identically.
Why It Matters
This is the theoretical guarantee that MH actually samples from the correct distribution in the long run. Without detailed balance, you would have no reason to believe the chain converges to .
Failure Mode
Detailed balance alone only guarantees is a stationary distribution. For to be the unique stationary distribution and for the chain to converge to it from any starting point, you also need the chain to be irreducible and aperiodic, which is the content of the ergodicity theorem.
Ergodicity of the MH Chain
Statement
If the MH chain is irreducible and aperiodic, then for any initial distribution :
Moreover, for any integrable function :
Intuition
Irreducibility means the chain can reach any state from any other state. Aperiodicity means it does not get trapped in deterministic cycles. Together with detailed balance, these conditions ensure the chain "forgets" its starting point and converges to . The ergodic theorem then says that time averages along the chain converge to expectations under .
Proof Sketch
Detailed balance implies -reversibility, which implies is stationary. Irreducibility and aperiodicity together with the existence of a stationary distribution imply convergence in total variation by the fundamental theorem of Markov chains. The ergodic theorem for Markov chains then gives the almost sure convergence of time averages.
Why It Matters
This theorem is what lets you use the output of an MH chain to compute expectations, posterior means, credible intervals, and any other quantity that can be expressed as . It is the theoretical justification for all of MCMC-based Bayesian inference.
Failure Mode
If the proposal is too narrow, the chain explores slowly and may not effectively reach all regions of high probability within a practical number of iterations. The chain is still ergodic in theory, but convergence may be so slow that your finite sample is useless. Diagnosing this requires convergence diagnostics (trace plots, , effective sample size).
Random Walk MH vs Independence Sampler
Random Walk Metropolis
In random walk MH, the proposal is centered at the current state:
for some symmetric density . A common choice is . The acceptance ratio simplifies to because is symmetric.
The step size controls a tradeoff: too small means slow exploration, too large means most proposals are rejected. The optimal acceptance rate for random walk MH in the specific high-dimensional diffusion-limit analysis of Roberts, Gelman, and Gilks (1997) is approximately 0.234. That is a useful benchmark for Gaussian random walks, not a universal target for every MH implementation.
Independence Sampler
In the independence sampler, the proposal ignores the current state:
The acceptance ratio becomes , which involves the ratio of importance weights. This works well only if is a good approximation to with heavier tails.
Burn-in
The chain's initial samples are influenced by the starting point and do not yet represent draws from . The burn-in period is the initial segment of the chain that is discarded before collecting samples for inference. Choosing the burn-in length is an art informed by convergence diagnostics.
Formally, burn-in discards samples and uses only for estimation. There is no universal formula for . It depends on the mixing rate of the chain.
Diagnostics and Proposal Geometry
The practical failure mode of MH is not incorrectness; it is slow mixing. Two chains can both target the right stationary distribution and yet differ by orders of magnitude in usable effective sample size.
Three diagnostics matter more than raw acceptance rate:
- Trace behavior and between-chain agreement. If chains started from dispersed initial values still occupy different regions, you do not yet have evidence of practical convergence. See burn-in and convergence diagnostics.
- Effective sample size (ESS). High autocorrelation means thousands of MH steps may correspond to only tens of effectively independent draws.
- Mode traversal. On multimodal targets, a random-walk proposal can look locally healthy while almost never crossing between important regions.
This is why plain random-walk MH becomes brittle in higher dimension. Proposal scales large enough to move globally are often rejected, while scales small enough to be accepted move only locally. Hamiltonian Monte Carlo was invented to escape this random-walk regime by proposing long, geometry-informed moves.
Where Plain MH Still Matters
Metropolis-Hastings remains valuable even when you would not deploy vanilla random-walk MH on a serious posterior:
- it is the cleanest proof template for why MCMC can target an unnormalized distribution;
- it still works well on low-dimensional targets, discrete spaces, and carefully engineered independence proposals;
- many specialized samplers are best understood as "better proposals inside the MH framework," including pseudo-marginal MH and some reversible-jump constructions.
Canonical Examples
Sampling from a mixture of Gaussians
Target: .
Using random walk MH with proposal :
- If : most proposals are accepted but the chain moves slowly and may get stuck in one mode for long stretches
- If : the chain proposes large jumps but most are rejected because they land in low-probability regions
- If : the chain efficiently explores both modes
At each step, compute (since the proposal is symmetric). If the chain is at and proposes , then . The move is always accepted. If the chain is at and proposes , then .
Bayesian inference for a normal mean
Prior: . Likelihood: for .
The posterior is:
This is a case where the posterior is available in closed form (it is normal), so we can verify that MH gives the right answer. Using random walk MH, the chain samples converge to the known posterior normal distribution with a precision-weighted mean.
Common Confusions
MH does NOT require knowing the normalizing constant
This is the single most important practical feature of MH. Because the acceptance ratio involves , any normalizing constant cancels:
You only need to evaluate the unnormalized target density. In Bayesian inference, this means you only need the prior times the likelihood; you never need to compute the evidence (marginal likelihood) .
Rejected proposals are NOT wasted
When a proposal is rejected and the chain stays at , this is not a failure. It is part of the algorithm working correctly. Repeated copies of in the chain are needed to correctly represent the probability mass at that point. An acceptance rate of 100% would mean the chain is not properly weighting different regions of the state space.
MH samples are NOT independent
Consecutive samples are correlated because each depends on the previous. This autocorrelation reduces the effective sample size. If you need approximately independent samples, you can thin the chain (keep every -th sample), though it is generally more efficient to simply run the chain longer and report the effective sample size.
A 0.234 acceptance rate is not a universal law
The famous number applies to one asymptotic regime: Gaussian random-walk proposals on high-dimensional product targets. It is a useful benchmark for that setting, not a universal optimum. Independence samplers, discrete proposals, heavy-tailed proposals, and geometry-aware methods can all have very different healthy acceptance rates.
High acceptance can still mean a bad sampler
If the proposal scale is tiny, acceptance can be near 100% while the chain barely moves. That is not good mixing; it is local dithering. Acceptance rate must be read together with ESS, trace behavior, and whether the chain actually traverses the relevant posterior mass.
Summary
- MH constructs a Markov chain with target as stationary distribution
- Acceptance ratio:
- Only need unnormalized target density. normalizing constants cancel
- Detailed balance is the core theoretical guarantee
- Ergodicity (irreducibility + aperiodicity) ensures convergence from any start
- Random walk MH: optimal acceptance rate in high dimensions
- Burn-in period must be discarded before using samples for inference
- Finite-time usefulness depends on proposal geometry, ESS, and whether the chain traverses modes in practice
Exercises
Problem
Suppose the proposal distribution is symmetric, i.e., for all . Derive the acceptance ratio and show that it depends only on the ratio .
Problem
Verify the detailed balance condition for MH directly. That is, show:
for the general (asymmetric) case.
Problem
Consider an independence sampler with proposal targeting (a Laplace distribution). Write down the acceptance ratio. Will this sampler work well? Why or why not?
References
Canonical:
- Metropolis, Rosenbluth, Rosenbluth, Teller, Teller (1953), "Equation of State Calculations by Fast Computing Machines"
- Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains and Their Applications"
Current:
- Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 6-7
- Brooks, Gelman, Jones, Meng, Handbook of Markov Chain Monte Carlo (2011), Chapter 1
- Tierney, "Markov Chains for Exploring Posterior Distributions" (Annals of Statistics, 1994)
- Roberts, Gelman, and Gilks, "Weak convergence and optimal scaling of random walk Metropolis algorithms" (Annals of Applied Probability, 1997)
- Gelman et al., Bayesian Data Analysis (2013), Chapters 11-12
Next Topics
The natural next steps from Metropolis-Hastings:
- Gibbs sampling: a special case of MH where the acceptance rate is always 1
- Burn-in and convergence diagnostics: how to tell a theoretically correct chain from a practically useless one
- Hamiltonian Monte Carlo: using gradient information for efficient proposals in high dimensions
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Common Probability Distributionslayer 0A · tier 1
- Markov Chain Monte Carlolayer 2 · tier 1
- Monte Carlo Methodslayer 2 · tier 1
- Markov Chains and Steady Statelayer 1 · tier 2
Derived topics
8- Gibbs Samplinglayer 2 · tier 1
- Burn-in and Convergence Diagnosticslayer 2 · tier 2
- Hamiltonian Monte Carlolayer 3 · tier 2
- Slice Samplinglayer 2 · tier 3
- Coupling Arguments and Mixing Timelayer 3 · tier 3
+3 more on the derived-topics page.