Variance Reduction Techniques

Sneiderman, Robby

Sampling MCMC

Variance Reduction Techniques

Get the same accuracy with fewer samples by exploiting correlation, known quantities, and stratification. Antithetic variates, control variates, stratification, and Rao-Blackwellization.

CoreTier 2StableSupporting~50 min

Prerequisites

Importance Sampling

Quiz (7)Pulse Check Prereq Map

Why This Matters

Monte Carlo estimation has a fundamental limitation: the standard error of the mean decreases as $O(1/\sqrt{n})$ . To halve the error, you need four times as many samples. Variance reduction techniques break this bottleneck by using structure in the problem to get more information per sample. In Bayesian inference, reinforcement learning, and simulation, these techniques can reduce computation by orders of magnitude.

Mental Model

Imagine estimating the average height of people in a city by random sampling. Naive: pick people at random. Smarter: sample equal numbers from each neighborhood (stratification). Even smarter: if you know the average income of each neighborhood and income correlates with height, use that information to adjust your estimate (control variates). The idea is always the same: use what you already know to reduce uncertainty in what you do not.

Formal Setup and Notation

We want to estimate $\mu = \mathbb{E}[f(X)]$ where $X \sim p$ . The naive Monte Carlo estimator is:

$\hat{\mu}_n = \frac{1}{n} \sum_{i=1}^{n} f(X_i), \quad X_i \sim p \text{ i.i.d.}$

with variance $\mathrm{Var}(\hat{\mu}_n) = \sigma^2 / n$ where $\sigma^2 = \mathrm{Var}(f(X))$ . Variance reduction constructs an alternative estimator $\hat{\mu}_n'$ with $\mathrm{Var}(\hat{\mu}_n') < \mathrm{Var}(\hat{\mu}_n)$ while keeping $\mathbb{E}[\hat{\mu}_n'] = \mu$ .

Definition

Antithetic Variates

Generate pairs $(X_i, X_i')$ that are negatively correlated but each marginally distributed as $p$ . The estimator:

$\hat{\mu}_n^{\text{AV}} = \frac{1}{n} \sum_{i=1}^{n} \frac{f(X_i) + f(X_i')}{2}$

has variance $\frac{1}{n}[\frac{\sigma^2}{2} + \frac{1}{2}\mathrm{Cov}(f(X), f(X'))]$ . When $\mathrm{Cov}(f(X), f(X')) < 0$ , this variance is less than $\sigma^2/n$ .

For example, if $X \sim \text{Uniform}(0,1)$ , set $X' = 1 - X$ . Then $X'$ is also $\text{Uniform}(0,1)$ but negatively correlated with $X$ . If $f$ is monotone, $f(X)$ and $f(1-X)$ are negatively correlated, so the variance drops.

Definition

Control Variates

Let $g(X)$ be a function whose expectation $\mathbb{E}[g(X)] = \mu_g$ is known. The control variate estimator is:

$\hat{\mu}_n^{\text{CV}} = \frac{1}{n} \sum_{i=1}^{n} [f(X_i) - c(g(X_i) - \mu_g)]$

for some coefficient $c$ . This is unbiased for any $c$ because $\mathbb{E}[g(X_i) - \mu_g] = 0$ . Its variance is:

$\mathrm{Var}(\hat{\mu}_n^{\text{CV}}) = \frac{1}{n}[\sigma^2 - 2c\,\mathrm{Cov}(f,g) + c^2\mathrm{Var}(g)]$

Definition

Stratified Sampling

Partition the sample space into disjoint strata $\Omega_1, \ldots, \Omega_K$ with $P(\Omega_k) = w_k$ . Sample $n_k$ points from each stratum (the conditional distribution $p(\cdot | \Omega_k)$ ) and combine:

$\hat{\mu}^{\text{strat}} = \sum_{k=1}^{K} w_k \hat{\mu}_k, \quad \hat{\mu}_k = \frac{1}{n_k}\sum_{i=1}^{n_k} f(X_i^{(k)})$

Under proportional allocation $n_k = n w_k$ , the variance is $\frac{1}{n}\sum_k w_k \sigma_k^2$ where $\sigma_k^2 = \mathrm{Var}(f(X) \mid \Omega_k)$ . This is always at most $\sigma^2/n$ , with strict improvement whenever the stratum means $\mu_k = \mathbb{E}[f(X) \mid \Omega_k]$ differ (the reduction equals $\frac{1}{n}\sum_k w_k (\mu_k - \mu)^2$ , the between-stratum variance). With arbitrary $n_k$ , stratified sampling can be worse than naive Monte Carlo; Neyman-optimal allocation $n_k \propto w_k \sigma_k$ is optimal when the $\sigma_k$ are known.

Definition

Rao-Blackwellization

If $X = (U, V)$ and you can compute $h(u) = \mathbb{E}[f(U,V) | U = u]$ analytically, then replace $f(X_i)$ with $h(U_i)$ :

$\hat{\mu}_n^{\text{RB}} = \frac{1}{n} \sum_{i=1}^{n} h(U_i)$

By the law of total variance, $\mathrm{Var}(h(U)) = \mathrm{Var}(f(X)) - \mathbb{E}[\mathrm{Var}(f(X)|U)] \leq \mathrm{Var}(f(X))$ . Conditioning out part of the randomness always reduces variance.

Main Theorems

Proposition

Optimal Control Variate Coefficient

Statement

The variance of the control variate estimator $\hat{\mu}^{\text{CV}} = f(X) - c(g(X) - \mu_g)$ is minimized by:

$c^* = \frac{\mathrm{Cov}(f(X), g(X))}{\mathrm{Var}(g(X))}$

The minimum variance is:

$\mathrm{Var}(f - c^*g) = \mathrm{Var}(f)(1 - \rho_{fg}^2)$

where $\rho_{fg}$ is the correlation between $f(X)$ and $g(X)$ .

Intuition

The optimal $c^*$ is the regression coefficient of $f$ on $g$ . The variance reduction factor is $1 - \rho_{fg}^2$ : if $f$ and $g$ are highly correlated ( $|\rho_{fg}| \approx 1$ ), almost all variance is eliminated. The control variate is like subtracting the "explained" part of $f$ .

Proof Sketch

$\mathrm{Var}(f - cg) = \mathrm{Var}(f) - 2c\,\mathrm{Cov}(f,g) + c^2 \mathrm{Var}(g)$ . This is a quadratic in $c$ with minimum at $c^* = \mathrm{Cov}(f,g)/\mathrm{Var}(g)$ . Substitute back to get $\mathrm{Var}(f)(1 - \rho_{fg}^2)$ .

Why It Matters

This tells you exactly how powerful a control variate will be: it depends entirely on the correlation $\rho_{fg}$ . With $|\rho_{fg}| = 0.99$ , you reduce variance by $99\%$ , equivalent to using $100\times$ more samples. In practice, you estimate $c^*$ from data, which adds small overhead.

Failure Mode

If $\rho_{fg} \approx 0$ , the control variate does nothing. Also, estimating $c^*$ from the same samples introduces a small bias in finite samples (the product of two estimated quantities). With large $n$ , this bias is negligible.

report a correction →

Theorem

Rao-Blackwell Theorem (Variance Reduction)

Statement

Let $h(U) = \mathbb{E}[f(U,V) | U]$ . Then:

$\mathbb{E}[h(U)] = \mathbb{E}[f(X)]$ $\mathrm{Var}(h(U)) \leq \mathrm{Var}(f(X))$

with equality if and only if $f(X) = h(U)$ almost surely (i.e., $f$ does not actually depend on $V$ ).

Intuition

By conditioning on $U$ , you analytically average out the randomness in $V$ . This removes the component of variance due to $V$ , leaving only the variance due to $U$ . It is like having infinitely many samples of $V$ for each value of $U$ .

Proof Sketch

By the law of total variance: $\mathrm{Var}(f(X)) = \mathrm{Var}(\mathbb{E}[f|U]) + \mathbb{E}[\mathrm{Var}(f|U)] = \mathrm{Var}(h(U)) + \mathbb{E}[\mathrm{Var}(f|U)]$ . Since $\mathbb{E}[\mathrm{Var}(f|U)] \geq 0$ , we get $\mathrm{Var}(h(U)) \leq \mathrm{Var}(f(X))$ .

Why It Matters

Rao-Blackwellization is the most principled variance reduction technique: it is guaranteed to help and never hurts. In Gibbs sampling, if you can compute conditional expectations for some variables analytically, always do so. It is free variance reduction.

Failure Mode

The technique requires being able to compute $\mathbb{E}[f|U]$ analytically, which is often intractable. It also requires choosing the partition $(U,V)$ wisely. If $V$ contributes little variance, the reduction is small.

report a correction →

Canonical Examples

Example

Control variate for option pricing

Estimating $\mathbb{E}[\max(S_T - K, 0)]$ for a call option. Use $g(X) = S_T$ (the terminal stock price) as a control variate. Under risk-neutral pricing, $\mathbb{E}[S_T] = S_0 e^{rT}$ is known. Since the payoff is highly correlated with $S_T$ , this dramatically reduces variance.

Example

Antithetic variates for integral estimation

Estimate $\int_0^1 e^x \, dx$ . Generate $U_i \sim \text{Uniform}(0,1)$ and use pairs $(U_i, 1-U_i)$ . Since $e^x$ is increasing, $e^{U_i}$ and $e^{1-U_i}$ are negatively correlated. The estimator $\frac{1}{2}(e^{U_i} + e^{1-U_i})$ has lower variance than $e^{U_i}$ alone.

Adjacent Techniques

Two techniques sit next to the four above and are worth naming even though they are covered in more detail elsewhere.

Common random numbers (CRN). When estimating a difference $\mathbb{E}[f(X)] - \mathbb{E}[g(X)]$ (for example, the effect of a policy change or a design parameter), reuse the same underlying random numbers for both simulations. The variance of the difference becomes $\mathrm{Var}(f(X) - g(X)) = \mathrm{Var}(f(X)) + \mathrm{Var}(g(X)) - 2\,\mathrm{Cov}(f(X), g(X))$ . When $f$ and $g$ are similar, the positive covariance cancels most of the variance, and the paired estimator is dramatically more efficient than two independent estimators. This is the Monte Carlo analog of a paired $t$ -test and is standard in simulation-based A/B comparisons.

Quasi-Monte Carlo (QMC). Replace i.i.d. samples with a low-discrepancy sequence (Sobol', Halton, Niederreiter nets). For smooth integrands in dimension $d$ , the Koksma-Hlawka inequality gives error $O((\log n)^d / n)$ , which beats Monte Carlo's $O(1/\sqrt{n})$ for moderate $d$ . Randomized QMC (scrambled Sobol') gives unbiased estimators with variance that can decrease as $O(n^{-3})$ for sufficiently smooth integrands. QMC does not slot into a plug-and-play "reduce variance" recipe. It replaces the sampling scheme itself and breaks the i.i.d. assumption underlying ordinary variance formulas. For finance and simulation-heavy ML pipelines (e.g., expectation computations in variational objectives), QMC is often the single biggest gain available.

Common Confusions

Watch Out

Variance reduction does not change the rate

Antithetic, control, stratified, and Rao-Blackwell estimators all reduce the constant in $\mathrm{Var} = C/n$ but keep the $1/n$ rate. You still need $4\times$ samples to halve the error. The improvement is in the constant $C$ , which can be enormous in practice but does not change the asymptotic rate.

QMC is the one exception: by abandoning i.i.d. sampling it achieves a faster-than- $1/n$ error decay on sufficiently smooth integrands. It is a genuinely different estimation regime, not a variance-reduction add-on to Monte Carlo.

Watch Out

Control variates require known expectations

The control variate $g$ must have a known mean $\mu_g$ . If you have to estimate $\mu_g$ , it is no longer a control variate. It becomes an importance sampling or regression adjustment problem.

Summary

Antithetic variates: use negative correlation between sample pairs
Control variates: subtract a known-mean quantity correlated with the target; optimal coefficient is $\mathrm{Cov}(f,g)/\mathrm{Var}(g)$
Stratification: partition the space and sample within strata; always helps
Rao-Blackwellization: condition out part of the randomness analytically; guaranteed to reduce variance by the law of total variance
Common random numbers: pair simulations that share a random seed when estimating differences of expectations
Quasi-Monte Carlo: replace i.i.d. samples with a low-discrepancy sequence for faster-than- $1/\sqrt{n}$ convergence on smooth integrands
Classical variance reduction changes the constant, not the $O(1/\sqrt{n})$ rate; QMC changes the rate itself

Exercises

ExerciseCore

Problem

You want to estimate $\mathbb{E}[e^X]$ where $X \sim N(0,1)$ . Explain how to use antithetic variates and why it reduces variance.

ExerciseAdvanced

Problem

Derive the optimal control variate coefficient $c^*$ and show that the variance reduction is $(1 - \rho^2) \cdot \mathrm{Var}(f)$ where $\rho$ is the correlation between $f(X)$ and $g(X)$ .

References

Canonical:

Robert & Casella, Monte Carlo Statistical Methods (2004), Chapter 4
Ross, Simulation (2012), Chapter 9
Glasserman, Monte Carlo Methods in Financial Engineering (Springer, 2004), Chapters 4 (variance reduction), 5 (QMC) — the standard practitioner reference; all four classical techniques plus CRN and QMC are treated with concrete examples

Current:

Owen, Monte Carlo Theory, Methods, and Examples (2013), Chapters 8-9 (variance reduction) and 15-17 (QMC) — open-access textbook
Dick, Kuo, Sloan, "High-dimensional integration: the quasi-Monte Carlo way," Acta Numerica 22 (2013), 133-288 — modern QMC theory
Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods (SIAM, 1992) — foundational QMC monograph
Gelman et al., Bayesian Data Analysis (2013), Chapters 10-12
Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from variance reduction:

Burn-in and convergence diagnostics: knowing when MCMC samples are usable
Hamiltonian Monte Carlo: a sampler that naturally has low variance

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Importance Samplinglayer 2 · tier 1

Derived topics

2

Burn-in and Convergence Diagnosticslayer 2 · tier 2
Hamiltonian Monte Carlolayer 3 · tier 2

Graph-backed continuations

Burn-in and Convergence Diagnostics Hamiltonian Monte Carlo