Skip to main content

Sampling MCMC

Variance Reduction Techniques

Get the same accuracy with fewer samples by exploiting correlation, known quantities, and stratification. Antithetic variates, control variates, stratification, and Rao-Blackwellization.

CoreTier 2StableSupporting~50 min

Prerequisites

Why This Matters

Monte Carlo estimation has a fundamental limitation: the standard error of the mean decreases as O(1/n)O(1/\sqrt{n}). To halve the error, you need four times as many samples. Variance reduction techniques break this bottleneck by using structure in the problem to get more information per sample. In Bayesian inference, reinforcement learning, and simulation, these techniques can reduce computation by orders of magnitude.

Mental Model

Imagine estimating the average height of people in a city by random sampling. Naive: pick people at random. Smarter: sample equal numbers from each neighborhood (stratification). Even smarter: if you know the average income of each neighborhood and income correlates with height, use that information to adjust your estimate (control variates). The idea is always the same: use what you already know to reduce uncertainty in what you do not.

Formal Setup and Notation

We want to estimate μ=E[f(X)]\mu = \mathbb{E}[f(X)] where XpX \sim p. The naive Monte Carlo estimator is:

μ^n=1ni=1nf(Xi),Xip i.i.d.\hat{\mu}_n = \frac{1}{n} \sum_{i=1}^{n} f(X_i), \quad X_i \sim p \text{ i.i.d.}

with variance Var(μ^n)=σ2/n\mathrm{Var}(\hat{\mu}_n) = \sigma^2 / n where σ2=Var(f(X))\sigma^2 = \mathrm{Var}(f(X)). Variance reduction constructs an alternative estimator μ^n\hat{\mu}_n' with Var(μ^n)<Var(μ^n)\mathrm{Var}(\hat{\mu}_n') < \mathrm{Var}(\hat{\mu}_n) while keeping E[μ^n]=μ\mathbb{E}[\hat{\mu}_n'] = \mu.

Definition

Antithetic Variates

Generate pairs (Xi,Xi)(X_i, X_i') that are negatively correlated but each marginally distributed as pp. The estimator:

μ^nAV=1ni=1nf(Xi)+f(Xi)2\hat{\mu}_n^{\text{AV}} = \frac{1}{n} \sum_{i=1}^{n} \frac{f(X_i) + f(X_i')}{2}

has variance 1n[σ22+12Cov(f(X),f(X))]\frac{1}{n}[\frac{\sigma^2}{2} + \frac{1}{2}\mathrm{Cov}(f(X), f(X'))]. When Cov(f(X),f(X))<0\mathrm{Cov}(f(X), f(X')) < 0, this variance is less than σ2/n\sigma^2/n.

For example, if XUniform(0,1)X \sim \text{Uniform}(0,1), set X=1XX' = 1 - X. Then XX' is also Uniform(0,1)\text{Uniform}(0,1) but negatively correlated with XX. If ff is monotone, f(X)f(X) and f(1X)f(1-X) are negatively correlated, so the variance drops.

Definition

Control Variates

Let g(X)g(X) be a function whose expectation E[g(X)]=μg\mathbb{E}[g(X)] = \mu_g is known. The control variate estimator is:

μ^nCV=1ni=1n[f(Xi)c(g(Xi)μg)]\hat{\mu}_n^{\text{CV}} = \frac{1}{n} \sum_{i=1}^{n} [f(X_i) - c(g(X_i) - \mu_g)]

for some coefficient cc. This is unbiased for any cc because E[g(Xi)μg]=0\mathbb{E}[g(X_i) - \mu_g] = 0. Its variance is:

Var(μ^nCV)=1n[σ22cCov(f,g)+c2Var(g)]\mathrm{Var}(\hat{\mu}_n^{\text{CV}}) = \frac{1}{n}[\sigma^2 - 2c\,\mathrm{Cov}(f,g) + c^2\mathrm{Var}(g)]

Definition

Stratified Sampling

Partition the sample space into disjoint strata Ω1,,ΩK\Omega_1, \ldots, \Omega_K with P(Ωk)=wkP(\Omega_k) = w_k. Sample nkn_k points from each stratum (the conditional distribution p(Ωk)p(\cdot | \Omega_k)) and combine:

μ^strat=k=1Kwkμ^k,μ^k=1nki=1nkf(Xi(k))\hat{\mu}^{\text{strat}} = \sum_{k=1}^{K} w_k \hat{\mu}_k, \quad \hat{\mu}_k = \frac{1}{n_k}\sum_{i=1}^{n_k} f(X_i^{(k)})

Under proportional allocation nk=nwkn_k = n w_k, the variance is 1nkwkσk2\frac{1}{n}\sum_k w_k \sigma_k^2 where σk2=Var(f(X)Ωk)\sigma_k^2 = \mathrm{Var}(f(X) \mid \Omega_k). This is always at most σ2/n\sigma^2/n, with strict improvement whenever the stratum means μk=E[f(X)Ωk]\mu_k = \mathbb{E}[f(X) \mid \Omega_k] differ (the reduction equals 1nkwk(μkμ)2\frac{1}{n}\sum_k w_k (\mu_k - \mu)^2, the between-stratum variance). With arbitrary nkn_k, stratified sampling can be worse than naive Monte Carlo; Neyman-optimal allocation nkwkσkn_k \propto w_k \sigma_k is optimal when the σk\sigma_k are known.

Definition

Rao-Blackwellization

If X=(U,V)X = (U, V) and you can compute h(u)=E[f(U,V)U=u]h(u) = \mathbb{E}[f(U,V) | U = u] analytically, then replace f(Xi)f(X_i) with h(Ui)h(U_i):

μ^nRB=1ni=1nh(Ui)\hat{\mu}_n^{\text{RB}} = \frac{1}{n} \sum_{i=1}^{n} h(U_i)

By the law of total variance, Var(h(U))=Var(f(X))E[Var(f(X)U)]Var(f(X))\mathrm{Var}(h(U)) = \mathrm{Var}(f(X)) - \mathbb{E}[\mathrm{Var}(f(X)|U)] \leq \mathrm{Var}(f(X)). Conditioning out part of the randomness always reduces variance.

Main Theorems

Proposition

Optimal Control Variate Coefficient

Statement

The variance of the control variate estimator μ^CV=f(X)c(g(X)μg)\hat{\mu}^{\text{CV}} = f(X) - c(g(X) - \mu_g) is minimized by:

c=Cov(f(X),g(X))Var(g(X))c^* = \frac{\mathrm{Cov}(f(X), g(X))}{\mathrm{Var}(g(X))}

The minimum variance is:

Var(fcg)=Var(f)(1ρfg2)\mathrm{Var}(f - c^*g) = \mathrm{Var}(f)(1 - \rho_{fg}^2)

where ρfg\rho_{fg} is the correlation between f(X)f(X) and g(X)g(X).

Intuition

The optimal cc^* is the regression coefficient of ff on gg. The variance reduction factor is 1ρfg21 - \rho_{fg}^2: if ff and gg are highly correlated (ρfg1|\rho_{fg}| \approx 1), almost all variance is eliminated. The control variate is like subtracting the "explained" part of ff.

Proof Sketch

Var(fcg)=Var(f)2cCov(f,g)+c2Var(g)\mathrm{Var}(f - cg) = \mathrm{Var}(f) - 2c\,\mathrm{Cov}(f,g) + c^2 \mathrm{Var}(g). This is a quadratic in cc with minimum at c=Cov(f,g)/Var(g)c^* = \mathrm{Cov}(f,g)/\mathrm{Var}(g). Substitute back to get Var(f)(1ρfg2)\mathrm{Var}(f)(1 - \rho_{fg}^2).

Why It Matters

This tells you exactly how powerful a control variate will be: it depends entirely on the correlation ρfg\rho_{fg}. With ρfg=0.99|\rho_{fg}| = 0.99, you reduce variance by 99%99\%, equivalent to using 100×100\times more samples. In practice, you estimate cc^* from data, which adds small overhead.

Failure Mode

If ρfg0\rho_{fg} \approx 0, the control variate does nothing. Also, estimating cc^* from the same samples introduces a small bias in finite samples (the product of two estimated quantities). With large nn, this bias is negligible.

Theorem

Rao-Blackwell Theorem (Variance Reduction)

Statement

Let h(U)=E[f(U,V)U]h(U) = \mathbb{E}[f(U,V) | U]. Then:

E[h(U)]=E[f(X)]\mathbb{E}[h(U)] = \mathbb{E}[f(X)] Var(h(U))Var(f(X))\mathrm{Var}(h(U)) \leq \mathrm{Var}(f(X))

with equality if and only if f(X)=h(U)f(X) = h(U) almost surely (i.e., ff does not actually depend on VV).

Intuition

By conditioning on UU, you analytically average out the randomness in VV. This removes the component of variance due to VV, leaving only the variance due to UU. It is like having infinitely many samples of VV for each value of UU.

Proof Sketch

By the law of total variance: Var(f(X))=Var(E[fU])+E[Var(fU)]=Var(h(U))+E[Var(fU)]\mathrm{Var}(f(X)) = \mathrm{Var}(\mathbb{E}[f|U]) + \mathbb{E}[\mathrm{Var}(f|U)] = \mathrm{Var}(h(U)) + \mathbb{E}[\mathrm{Var}(f|U)]. Since E[Var(fU)]0\mathbb{E}[\mathrm{Var}(f|U)] \geq 0, we get Var(h(U))Var(f(X))\mathrm{Var}(h(U)) \leq \mathrm{Var}(f(X)).

Why It Matters

Rao-Blackwellization is the most principled variance reduction technique: it is guaranteed to help and never hurts. In Gibbs sampling, if you can compute conditional expectations for some variables analytically, always do so. It is free variance reduction.

Failure Mode

The technique requires being able to compute E[fU]\mathbb{E}[f|U] analytically, which is often intractable. It also requires choosing the partition (U,V)(U,V) wisely. If VV contributes little variance, the reduction is small.

Canonical Examples

Example

Control variate for option pricing

Estimating E[max(STK,0)]\mathbb{E}[\max(S_T - K, 0)] for a call option. Use g(X)=STg(X) = S_T (the terminal stock price) as a control variate. Under risk-neutral pricing, E[ST]=S0erT\mathbb{E}[S_T] = S_0 e^{rT} is known. Since the payoff is highly correlated with STS_T, this dramatically reduces variance.

Example

Antithetic variates for integral estimation

Estimate 01exdx\int_0^1 e^x \, dx. Generate UiUniform(0,1)U_i \sim \text{Uniform}(0,1) and use pairs (Ui,1Ui)(U_i, 1-U_i). Since exe^x is increasing, eUie^{U_i} and e1Uie^{1-U_i} are negatively correlated. The estimator 12(eUi+e1Ui)\frac{1}{2}(e^{U_i} + e^{1-U_i}) has lower variance than eUie^{U_i} alone.

Adjacent Techniques

Two techniques sit next to the four above and are worth naming even though they are covered in more detail elsewhere.

Common random numbers (CRN). When estimating a difference E[f(X)]E[g(X)]\mathbb{E}[f(X)] - \mathbb{E}[g(X)] (for example, the effect of a policy change or a design parameter), reuse the same underlying random numbers for both simulations. The variance of the difference becomes Var(f(X)g(X))=Var(f(X))+Var(g(X))2Cov(f(X),g(X))\mathrm{Var}(f(X) - g(X)) = \mathrm{Var}(f(X)) + \mathrm{Var}(g(X)) - 2\,\mathrm{Cov}(f(X), g(X)). When ff and gg are similar, the positive covariance cancels most of the variance, and the paired estimator is dramatically more efficient than two independent estimators. This is the Monte Carlo analog of a paired tt-test and is standard in simulation-based A/B comparisons.

Quasi-Monte Carlo (QMC). Replace i.i.d. samples with a low-discrepancy sequence (Sobol', Halton, Niederreiter nets). For smooth integrands in dimension dd, the Koksma-Hlawka inequality gives error O((logn)d/n)O((\log n)^d / n), which beats Monte Carlo's O(1/n)O(1/\sqrt{n}) for moderate dd. Randomized QMC (scrambled Sobol') gives unbiased estimators with variance that can decrease as O(n3)O(n^{-3}) for sufficiently smooth integrands. QMC does not slot into a plug-and-play "reduce variance" recipe. It replaces the sampling scheme itself and breaks the i.i.d. assumption underlying ordinary variance formulas. For finance and simulation-heavy ML pipelines (e.g., expectation computations in variational objectives), QMC is often the single biggest gain available.

Common Confusions

Watch Out

Variance reduction does not change the rate

Antithetic, control, stratified, and Rao-Blackwell estimators all reduce the constant in Var=C/n\mathrm{Var} = C/n but keep the 1/n1/n rate. You still need 4×4\times samples to halve the error. The improvement is in the constant CC, which can be enormous in practice but does not change the asymptotic rate.

QMC is the one exception: by abandoning i.i.d. sampling it achieves a faster-than-1/n1/n error decay on sufficiently smooth integrands. It is a genuinely different estimation regime, not a variance-reduction add-on to Monte Carlo.

Watch Out

Control variates require known expectations

The control variate gg must have a known mean μg\mu_g. If you have to estimate μg\mu_g, it is no longer a control variate. It becomes an importance sampling or regression adjustment problem.

Summary

  • Antithetic variates: use negative correlation between sample pairs
  • Control variates: subtract a known-mean quantity correlated with the target; optimal coefficient is Cov(f,g)/Var(g)\mathrm{Cov}(f,g)/\mathrm{Var}(g)
  • Stratification: partition the space and sample within strata; always helps
  • Rao-Blackwellization: condition out part of the randomness analytically; guaranteed to reduce variance by the law of total variance
  • Common random numbers: pair simulations that share a random seed when estimating differences of expectations
  • Quasi-Monte Carlo: replace i.i.d. samples with a low-discrepancy sequence for faster-than-1/n1/\sqrt{n} convergence on smooth integrands
  • Classical variance reduction changes the constant, not the O(1/n)O(1/\sqrt{n}) rate; QMC changes the rate itself

Exercises

ExerciseCore

Problem

You want to estimate E[eX]\mathbb{E}[e^X] where XN(0,1)X \sim N(0,1). Explain how to use antithetic variates and why it reduces variance.

ExerciseAdvanced

Problem

Derive the optimal control variate coefficient cc^* and show that the variance reduction is (1ρ2)Var(f)(1 - \rho^2) \cdot \mathrm{Var}(f) where ρ\rho is the correlation between f(X)f(X) and g(X)g(X).

References

Canonical:

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapter 4
  • Ross, Simulation (2012), Chapter 9
  • Glasserman, Monte Carlo Methods in Financial Engineering (Springer, 2004), Chapters 4 (variance reduction), 5 (QMC) — the standard practitioner reference; all four classical techniques plus CRN and QMC are treated with concrete examples

Current:

  • Owen, Monte Carlo Theory, Methods, and Examples (2013), Chapters 8-9 (variance reduction) and 15-17 (QMC) — open-access textbook
  • Dick, Kuo, Sloan, "High-dimensional integration: the quasi-Monte Carlo way," Acta Numerica 22 (2013), 133-288 — modern QMC theory
  • Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods (SIAM, 1992) — foundational QMC monograph
  • Gelman et al., Bayesian Data Analysis (2013), Chapters 10-12
  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from variance reduction:

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

2