Skip to main content

Statistical Estimation

Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants

The fundamental information inequality for unbiased estimation. Coverage of the scalar and multivariate Cramér-Rao bounds, the chain rule for biased estimators, achievability in exponential families, the Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound (no regularity required), the van Trees Bayesian inequality, and efficient information with nuisance parameters.

AdvancedCoreTier 1StableCore spine~90 min
For:MLStats

Why This Matters

theorem visual

Variance Floor

The Cramér-Rao bound turns local information into a hard variance floor: sharper likelihood geometry means unbiased estimators can concentrate more tightly, but no unbiased procedure beats the inverse-information barrier.

information level = medium

INFORMATION GEOMETRYESTIMATOR CONSEQUENCEmoderate bowl
local log-likelihood around the truthmoderate spreadinfo = 1.00unbiased estimators stay above the floorvariance floor

Variance floor

Efficient estimators

When an estimator reaches the bound asymptotically, it is squeezing all available information out of the model without leaving variance on the table.

Important caveat

The classical floor is for unbiased estimators under regularity assumptions. Biased procedures can look better numerically, but then the right benchmark is a biased Cramér-Rao bound or a different risk criterion altogether.

The Cramér-Rao bound is the universal floor on the variance of any unbiased estimator. For an i.i.d. sample of size nn from p(xθ)p(x \mid \theta),

Varθ(θ^)1nI(θ)\mathrm{Var}_\theta(\hat\theta) \geq \frac{1}{n I(\theta)}

where I(θ)I(\theta) is the Fisher information per observation. The bound has three roles in modern statistics. First, it certifies efficiency: the maximum likelihood estimator attains this bound asymptotically, which is why MLE is the default. Second, it grades estimator design: any concrete unbiased estimator can be compared to 1/(nI(θ))1/(nI(\theta)) to see how much room remains. Third, it links estimation to information geometry: I(θ)I(\theta) is the Riemannian metric on the statistical manifold, and the bound is the metric-induced lower bound on the squared geodesic length of any unbiased displacement.

The bound is one Cauchy-Schwarz inequality applied to the score function. The proof is short, the implications are deep, and the failure modes (biased estimators, non-regular families, nuisance parameters, finite-sample sharpness) generate a family of refinements: Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound (no regularity required), the van Trees Bayesian inequality, the multivariate Loewner ordering bound, and the BKRW (Bickel-Klaassen-Ritov-Wellner) semiparametric efficient information.

Mental Model

Think of the score s(x;θ)=θlogp(xθ)s(x; \theta) = \partial_\theta \log p(x \mid \theta) as a direction in the tangent space at θ\theta. The expected score is zero (the first Bartlett identity); the variance of the score is the Fisher information I(θ)I(\theta) (the second Bartlett identity). Any unbiased estimator θ^\hat\theta satisfies E[θ^]=θ\mathbb E[\hat\theta] = \theta, so differentiating both sides and exchanging differentiation with integration gives Cov(θ^,Sn)=1\mathrm{Cov}(\hat\theta, S_n) = 1, where Sn=is(Xi;θ)S_n = \sum_i s(X_i; \theta) is the total score. Cauchy-Schwarz on this covariance and on Var(Sn)=nI(θ)\mathrm{Var}(S_n) = nI(\theta) closes the bound:

1=[Cov(θ^,Sn)]2Var(θ^)nI(θ).1 = [\mathrm{Cov}(\hat\theta, S_n)]^2 \leq \mathrm{Var}(\hat\theta) \cdot nI(\theta).

Equality holds iff θ^θ\hat\theta - \theta is collinear with the score, which forces an exponential-family structure: only one-parameter exponential families (in their canonical sufficient statistic) admit a finite-sample efficient unbiased estimator. Outside that special case, efficiency is asymptotic, not finite-sample.

The bound generalizes in several directions: the multivariate version replaces scalars with positive-definite matrices in the Loewner order; the biased-estimator version multiplies by the squared bias derivative; the Hammersley-Chapman-Robbins replaces the score with a finite difference and removes the regularity requirement; the van Trees inequality averages over a prior; the efficient-information version projects out nuisance directions. All of these are one Cauchy-Schwarz inequality applied to a slightly different inner-product structure.

Formal Setup

Let X1,,XnX_1, \ldots, X_n be i.i.d. from p(xθ)p(x \mid \theta) with θΘRd\theta \in \Theta \subseteq \mathbb R^d. Let n(θ)=i=1nlogp(Xiθ)\ell_n(\theta) = \sum_{i=1}^n \log p(X_i \mid \theta) denote the log-likelihood and Sn(θ)=θn(θ)S_n(\theta) = \nabla_\theta \ell_n(\theta) the score.

Definition

Score function

The score function is the gradient of the log-density:

s(x;θ)=θlogp(xθ).s(x; \theta) = \nabla_\theta \log p(x \mid \theta).

Under the regularity conditions below (support of pp does not depend on θ\theta and differentiation passes through the integral), Eθ[s(X;θ)]=0\mathbb E_\theta[s(X; \theta)] = 0 for all θ\theta. This is the first Bartlett identity.

Definition

Fisher information

The Fisher information matrix is the covariance of the score:

I(θ)=Covθ[s(X;θ)]=Eθ[s(X;θ)s(X;θ)].I(\theta) = \mathrm{Cov}_\theta[s(X; \theta)] = \mathbb E_\theta[s(X; \theta) s(X; \theta)^\top].

Under twice-differentiability and regularity, I(θ)=Eθ[θ2logp(Xθ)]I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)] (the second Bartlett identity; see maximum likelihood estimation for the proof). For an i.i.d. sample of size nn, the total Fisher information is nI(θ)n I(\theta).

Definition

Regularity conditions

A parametric family {p(θ)}\{p(\cdot \mid \theta)\} is Cramér-regular at θ0\theta_0 if and only if (1) the support of pp does not depend on θ\theta in a neighborhood of θ0\theta_0, (2) logp(xθ)\log p(x \mid \theta) is twice continuously differentiable in θ\theta for almost every xx, (3) differentiation of p(xθ)dx=1\int p(x \mid \theta) dx = 1 with respect to θ\theta can be performed under the integral sign (e.g. by dominated convergence), and (4) the Fisher information I(θ0)I(\theta_0) is finite and positive definite. The Cramér-Rao bound assumes these conditions.

Definition

Efficient estimator

An unbiased estimator θ^\hat\theta of a scalar parameter θ\theta is finite-sample efficient if and only if it attains the Cramér-Rao lower bound with equality at every nn and every θ\theta: Varθ(θ^)=1/(nI(θ))\mathrm{Var}_\theta(\hat\theta) = 1/(nI(\theta)). An estimator θ^n\hat\theta_n is asymptotically efficient if and only if n(θ^nθ)N(0,I(θ)1)\sqrt n (\hat\theta_n - \theta) \Rightarrow \mathcal N(0, I(\theta)^{-1}). The MLE is asymptotically efficient under standard regularity (see asymptotic statistics); finite-sample efficiency is much rarer and characterizes one-parameter exponential families.

The Cramér-Rao Bound

Theorem

Cramér-Rao Lower Bound (Scalar)

Statement

For any unbiased estimator θ^\hat\theta of θ\theta,

Varθ(θ^)1nI(θ).\mathrm{Var}_\theta(\hat\theta) \geq \frac{1}{n I(\theta)}.

Equality at a fixed θ\theta holds if and only if there exists a function a(θ)0a(\theta) \neq 0 such that

i=1ns(Xi;θ)=a(θ)(θ^θ)Pθ-a.s.\sum_{i=1}^n s(X_i; \theta) = a(\theta) (\hat\theta - \theta) \quad \mathbb P_\theta\text{-a.s.}

Equality at every θ\theta in an open neighborhood characterizes one-parameter exponential families.

Intuition

The score is the direction in which the log-likelihood rises fastest. An unbiased estimator must be sensitive to the parameter in the same direction (since perturbing θ\theta shifts E[θ^]\mathbb E[\hat\theta] by exactly the same amount). Cauchy-Schwarz says this collinearity has a cost: the unbiased estimator's variance must absorb at least the inverse of the score's variance. High Fisher information means the score is large in magnitude and tightly concentrated, so the data is informative and small variance is achievable.

Proof Sketch

Apply Cauchy-Schwarz to Cov(θ^,Sn)\mathrm{Cov}(\hat\theta, S_n) with Sn=i=1ns(Xi;θ)S_n = \sum_{i=1}^n s(X_i; \theta):

[Cov(θ^,Sn)]2Var(θ^)Var(Sn).[\mathrm{Cov}(\hat\theta, S_n)]^2 \leq \mathrm{Var}(\hat\theta) \cdot \mathrm{Var}(S_n).

Step 1. Differentiate Eθ[θ^]=θ\mathbb E_\theta[\hat\theta] = \theta in θ\theta and exchange differentiation with integration (valid by Cramér regularity):

1=ddθθ^(x)p(xθ)dx=θ^(x)s(x;θ)p(xθ)dx=E[θ^s(X;θ)]=Cov(θ^,S1)1 = \frac{d}{d\theta} \int \hat\theta(x) p(x \mid \theta) \, dx = \int \hat\theta(x) \, s(x; \theta) p(x \mid \theta) \, dx = \mathbb E[\hat\theta \cdot s(X; \theta)] = \mathrm{Cov}(\hat\theta, S_1)

(the last equality because E[s]=0\mathbb E[s] = 0). For the i.i.d. sum, Cov(θ^,Sn)=1\mathrm{Cov}(\hat\theta, S_n) = 1 by linearity since only the score on the same sample contributes.

Step 2. Var(Sn)=nI(θ)\mathrm{Var}(S_n) = n I(\theta) since the per-sample scores are i.i.d. with variance I(θ)I(\theta).

Step 3. Substituting into Cauchy-Schwarz: 1Var(θ^)nI(θ)1 \leq \mathrm{Var}(\hat\theta) \cdot n I(\theta).

Equality in Cauchy-Schwarz requires θ^E[θ^]=cSn\hat\theta - \mathbb E[\hat\theta] = c \cdot S_n a.s. for some non-random cc, which translates to the collinearity condition stated.

Why It Matters

This is the central inequality of classical estimation theory. Combined with the asymptotic efficiency of the MLE (see asymptotic statistics), it identifies 1/(nI(θ))1/(nI(\theta)) as the universal asymptotic floor for unbiased estimation under regularity. Every alternative estimator has to be benchmarked against it: if Var(θ^alt)>1/(nI(θ))\mathrm{Var}(\hat\theta_{\mathrm{alt}}) > 1/(nI(\theta)) asymptotically, the MLE wins; if it equals it, both are asymptotically efficient. The bound also identifies the parameters that are hard to estimate (small I(θ)I(\theta)) and the ones that are easy (large I(θ)I(\theta)).

Failure Mode

Biased estimators escape the bound. The James-Stein estimator beats the sample mean in MSE for d3d \geq 3; ridge regression, LASSO, and posterior mean estimators all introduce bias in exchange for variance reduction. The Cramér-Rao bound governs only the unbiased class.

Non-regular families. For XU(0,θ)X \sim U(0, \theta) the support depends on θ\theta, θlogp\partial_\theta \log p does not exist almost everywhere, and the bound does not apply. The MLE θ^=X(n)\hat\theta = X_{(n)} has variance θ2/[n(n+2)](1+o(1))\theta^2 / [n(n+2)] \cdot (1 + o(1)), which is Θ(θ2/n2)\Theta(\theta^2 / n^2), faster than the 1/n1/n parametric rate. The bound is silent here; use the Hammersley-Chapman-Robbins bound instead.

Bound not attained at finite nn. Outside one-parameter exponential families, no unbiased estimator attains the bound at every nn. For example, estimating σ\sigma from X1,,XnN(0,σ2)X_1, \ldots, X_n \sim \mathcal N(0, \sigma^2): the unbiased estimator σ^=cnXi2\hat\sigma = c_n \sqrt{\sum X_i^2} has variance strictly larger than 1/(nI(σ))=σ2/(2n)1/(nI(\sigma)) = \sigma^2/(2n) at finite nn, with the gap shrinking at rate 1/n1/n.

Multivariate Form

Theorem

Multivariate Cramér-Rao Bound

Statement

For any unbiased estimator θ^\hat\theta of θRd\theta \in \mathbb R^d,

Covθ(θ^)1nI(θ)1\mathrm{Cov}_\theta(\hat\theta) \succeq \frac{1}{n} I(\theta)^{-1}

in the Loewner (positive semidefinite) ordering: Cov(θ^)1nI(θ)1\mathrm{Cov}(\hat\theta) - \frac{1}{n} I(\theta)^{-1} is positive semidefinite. Equivalently, for every vRdv \in \mathbb R^d,

Varθ(vθ^)1nvI(θ)1v.\mathrm{Var}_\theta(v^\top \hat\theta) \geq \frac{1}{n} v^\top I(\theta)^{-1} v.

Intuition

The matrix bound is the simultaneous lower bound on the variance of every linear combination of the parameter estimates. Parameters that are jointly poorly identified (small eigenvalues of I(θ)I(\theta) in their direction) have correspondingly large minimum-variance bounds. The inverse Fisher matrix I(θ)1I(\theta)^{-1} is the covariance template that the optimal unbiased estimator must equal or exceed in every direction.

Proof Sketch

Apply the matrix Cauchy-Schwarz inequality to the score-vector SnS_n and parameter-vector θ^\hat\theta. Differentiating Eθ[θ^]=θ\mathbb E_\theta[\hat\theta] = \theta component-wise gives Cov(θ^,Sn)=Id\mathrm{Cov}(\hat\theta, S_n) = I_d (identity matrix). The total score has covariance Cov(Sn)=nI(θ)\mathrm{Cov}(S_n) = n I(\theta). Matrix Cauchy-Schwarz then gives Cov(θ^)Cov(θ^,Sn)[Cov(Sn)]1Cov(Sn,θ^)=(nI(θ))1\mathrm{Cov}(\hat\theta) \succeq \mathrm{Cov}(\hat\theta, S_n) [\mathrm{Cov}(S_n)]^{-1} \mathrm{Cov}(S_n, \hat\theta) = (n I(\theta))^{-1}. Specializing to a fixed direction vv, Var(vθ^)=vCov(θ^)vvI(θ)1v/n\mathrm{Var}(v^\top \hat\theta) = v^\top \mathrm{Cov}(\hat\theta) v \geq v^\top I(\theta)^{-1} v / n. Note this is not 1/(vI(θ)v)1/(v^\top I(\theta) v): the inverse comes after the contraction with vv, not before. The two coincide only when vv is a Fisher eigenvector and II is diagonal in that direction; in general they differ by the Schur-complement penalty for the orthogonal nuisance directions.

Why It Matters

Real models almost always have multiple parameters. The matrix bound governs how trade-offs between parameters propagate. A model whose Fisher matrix has a small eigenvalue is weakly identified in the corresponding direction: any unbiased estimator must accept large variance in that direction. This connects directly to ill-conditioned regression and to the natural gradient (which preconditions descent directions by I1I^{-1} to remove this distortion).

Failure Mode

The bound applies to unbiased estimators only. The multivariate biased-estimator version (next theorem) replaces I1I^{-1} with JI1J\mathcal J I^{-1} \mathcal J^\top where J=θE[θ^]\mathcal J = \nabla_\theta \mathbb E[\hat\theta] is the Jacobian of the bias-corrected mean. When I(θ)I(\theta) is singular (rank-deficient, i.e. the model is non-identifiable in some direction), the bound is vacuous in that direction; use a generalized inverse I+I^+ and restrict to identifiable contrasts.

Biased Estimators: The Chain-Rule Bound

Theorem

Cramér-Rao Bound for Biased Estimators

Statement

For any (possibly biased) estimator T(X)T(X) with mean ψ(θ)=Eθ[T(X)]\psi(\theta) = \mathbb E_\theta[T(X)],

Covθ(T)1nJ(θ)I(θ)1J(θ)\mathrm{Cov}_\theta(T) \succeq \frac{1}{n} \mathcal J(\theta) I(\theta)^{-1} \mathcal J(\theta)^\top

where J(θ)=θψ(θ)\mathcal J(\theta) = \nabla_\theta \psi(\theta) is the Jacobian of ψ\psi. In the scalar case this reduces to

Varθ(T)[ψ(θ)]2nI(θ).\mathrm{Var}_\theta(T) \geq \frac{[\psi'(\theta)]^2}{n I(\theta)}.

Intuition

A biased estimator's mean shifts at rate ψ(θ)\psi'(\theta) as θ\theta changes (not at rate 11 as for unbiased estimators). The Cauchy-Schwarz argument now gives a covariance of ψ(θ)\psi'(\theta) between TT and the score, so the squared-covariance side of Cauchy-Schwarz becomes [ψ(θ)]2[\psi'(\theta)]^2 instead of 11. The bound scales accordingly.

Proof Sketch

Repeat the scalar proof with TT replacing θ^\hat\theta. Differentiating Eθ[T]=ψ(θ)\mathbb E_\theta[T] = \psi(\theta) gives Cov(T,Sn)=nψ(θ)/n=ψ(θ)\mathrm{Cov}(T, S_n) = n \cdot \psi'(\theta) / n = \psi'(\theta) by the same exchange-of-differentiation step. Cauchy-Schwarz then gives [ψ(θ)]2Var(T)nI(θ)[\psi'(\theta)]^2 \leq \mathrm{Var}(T) \cdot n I(\theta).

Why It Matters

This is the bound that practitioners actually need. Most useful estimators are biased: ridge, LASSO, posterior means, shrinkage, kernel-smoothed estimates. The chain-rule form gives the variance lower bound for any such estimator, parameterized by how its mean depends on θ\theta. It also justifies the delta method: an estimator of g(θ)g(\theta) has variance lower-bounded by [g(θ)]2/(nI(θ))[g'(\theta)]^2 / (n I(\theta)), which matches the asymptotic variance of g(θ^MLE)g(\hat\theta_{\mathrm{MLE}}) exactly under standard regularity.

Failure Mode

The chain-rule bound does not directly bound MSE, only variance. To bound MSE, add the squared bias: MSE(T)=Var(T)+[ψ(θ)θ]2[ψ(θ)]2/(nI(θ))+[ψ(θ)θ]2\mathrm{MSE}(T) = \mathrm{Var}(T) + [\psi(\theta) - \theta]^2 \geq [\psi'(\theta)]^2 / (n I(\theta)) + [\psi(\theta) - \theta]^2. The James-Stein estimator beats the sample mean precisely because its bias is small while its variance reduction is large; the Cramér-Rao bound in either form does not preclude this.

Sharper Bounds: Bhattacharyya and Hammersley-Chapman-Robbins

The Cramér-Rao bound uses only the first derivative of the log-likelihood. Two sharper bounds use higher-order information or no derivatives at all.

Theorem

Bhattacharyya Higher-Order Bound

Statement

Let S(j)(θ)=jlogp(Xθ)/θjS^{(j)}(\theta) = \partial^j \log p(X \mid \theta) / \partial \theta^j for j=1,,kj = 1, \ldots, k. Define the Gram matrix B(θ)\mathcal B(\theta) with entries Bij=E[S(i)S(j)]\mathcal B_{ij} = \mathbb E[S^{(i)} S^{(j)}] (the Bhattacharyya information matrix). Then for any unbiased θ^\hat\theta,

Varθ(θ^)1n1B(θ)11\mathrm{Var}_\theta(\hat\theta) \geq \frac{1}{n} \mathbf 1^\top \mathcal B(\theta)^{-1} \mathbf 1

where 1=(1,0,,0)Rk\mathbf 1 = (1, 0, \ldots, 0)^\top \in \mathbb R^k (since only the first-order moment of θ^\hat\theta depends on θ\theta). The k=1k=1 case recovers the standard Cramér-Rao bound; k2k \geq 2 gives strictly tighter bounds when the higher-order information matrix is non-degenerate.

Intuition

The standard Cramér-Rao bound projects θ^\hat\theta onto the first score S(1)S^{(1)} alone. The Bhattacharyya bound projects onto the larger linear span {S(1),S(2),,S(k)}\{S^{(1)}, S^{(2)}, \ldots, S^{(k)}\} of higher-order log-likelihood derivatives, capturing more of the unbiased estimator's structure. The bound is sharper because the projection onto a larger subspace has equal-or-larger norm.

Proof Sketch

For each jj, differentiate E[θ^]=θ\mathbb E[\hat\theta] = \theta exactly jj times. The j=1j=1 derivative gives Cov(θ^,S(1))=1\mathrm{Cov}(\hat\theta, S^{(1)}) = 1; higher-order derivatives give Cov(θ^,S(j))=0\mathrm{Cov}(\hat\theta, S^{(j)}) = 0 for j2j \geq 2 (since θ\theta is linear in θ\theta). Apply Cauchy-Schwarz to the projection of θ^θ\hat\theta - \theta onto span(S(1),,S(k))\mathrm{span}(S^{(1)}, \ldots, S^{(k)}) using the Gram inverse B1\mathcal B^{-1}.

Why It Matters

Bhattacharyya bounds quantify how much sharpness is left on the table by the standard Cramér-Rao bound. In some non-exponential models, the gap is large at finite nn. The bound also generalizes naturally to the multivariate case (replace 1\mathbf 1 by an identity-block selection matrix).

Failure Mode

For one-parameter exponential families, all higher-order Bhattacharyya bounds collapse to the Cramér-Rao bound (the higher-order derivatives are linearly dependent on the first), so no sharpening is possible. Bhattacharyya helps in non-exponential families where the bound is not attained by the Cramér-Rao argument.

Theorem

Hammersley-Chapman-Robbins Bound (no regularity)

Statement

For any unbiased estimator θ^\hat\theta of θ\theta and any δ0\delta \neq 0 such that p(xθ+δ)p(x \mid \theta + \delta) and p(xθ)p(x \mid \theta) are mutually absolutely continuous,

Varθ(θ^)supδ0δ2χ2(pθ+δpθ)\mathrm{Var}_\theta(\hat\theta) \geq \sup_{\delta \neq 0} \frac{\delta^2}{\chi^2(p_{\theta + \delta} \,\|\, p_\theta)}

where χ2(qp)=Ep[(q/p1)2]=(qp)2/pdx\chi^2(q \,\|\, p) = \mathbb E_p[(q/p - 1)^2] = \int (q - p)^2 / p \, dx is the chi-squared divergence. As δ0\delta \to 0 in a Cramér-regular family, the bound recovers the Cramér-Rao bound (since χ2(pθ+δpθ)=δ2I(θ)+O(δ4)\chi^2(p_{\theta + \delta} \,\|\, p_\theta) = \delta^2 I(\theta) + O(\delta^4)).

Intuition

Replace the score (a derivative) with a finite difference. The bound asks: how much can the parameter shift by δ\delta before the data distribution becomes too distinguishable? An unbiased estimator that tracks θ\theta exactly must also track shifts θθ+δ\theta \mapsto \theta + \delta, so its variance must be at least δ2\delta^2 times a precision factor measured by chi-squared distance. The supremum over δ\delta tightens the bound; for small δ\delta the bound recovers the local Cramér-Rao bound, for large δ\delta it can be tighter or sharper at finite nn.

Proof Sketch

Define the likelihood ratio L(x)=p(xθ+δ)/p(xθ)1L(x) = p(x \mid \theta + \delta) / p(x \mid \theta) - 1. Since θ^\hat\theta is unbiased, Eθ[θ^L(X)]=Eθ+δ[θ^]Eθ[θ^]=(θ+δ)θ=δ\mathbb E_\theta[\hat\theta L(X)] = \mathbb E_{\theta + \delta}[\hat\theta] - \mathbb E_\theta[\hat\theta] = (\theta + \delta) - \theta = \delta. Apply Cauchy-Schwarz:

δ2=[Covθ(θ^,L)]2Varθ(θ^)Varθ(L)=Varθ(θ^)χ2(pθ+δpθ).\delta^2 = [\mathrm{Cov}_\theta(\hat\theta, L)]^2 \leq \mathrm{Var}_\theta(\hat\theta) \cdot \mathrm{Var}_\theta(L) = \mathrm{Var}_\theta(\hat\theta) \cdot \chi^2(p_{\theta + \delta} \,\|\, p_\theta).

Rearrange and take the sup over δ\delta.

Why It Matters

This is the bound to use when Cramér regularity fails. For XU(0,θ)X \sim U(0, \theta), the chi-squared divergence between U(0,θ)U(0, \theta) and U(0,θ+δ)U(0, \theta + \delta) is finite for δ<0\delta < 0 (and infinite for δ>0\delta > 0), and computing the supremum gives a non-trivial lower bound on the variance of any unbiased estimator. The HCR bound also handles discrete parameters, irregular boundaries, and any setting where the score is undefined or unbounded.

Failure Mode

The chi-squared divergence can be infinite (e.g. when supports are disjoint), in which case the bound for that δ\delta is zero (vacuous). Choose δ\delta small enough that χ2\chi^2 is finite. For complicated parametric families, the HCR bound is hard to compute analytically; the Cramér-Rao bound is preferred when regularity holds.

The Bayesian Extension: van Trees Inequality

Theorem

van Trees (Bayesian Cramér-Rao) Inequality

Statement

For any estimator θ^(X)\hat\theta(X) of θ\theta (no unbiasedness required),

EπEθ[(θ^θ)2]1Eπ[nI(θ)]+I(π)\mathbb E_\pi \mathbb E_\theta[(\hat\theta - \theta)^2] \geq \frac{1}{\mathbb E_\pi[n I(\theta)] + I(\pi)}

where I(π)=(π(θ)/π(θ))2π(θ)dθI(\pi) = \int (\pi'(\theta) / \pi(\theta))^2 \pi(\theta) \, d\theta is the Fisher information of the prior with respect to the location family.

Intuition

The Bayesian Cramér-Rao bound combines two information sources: the data information Eπ[nI(θ)]\mathbb E_\pi[n I(\theta)] (averaged over the prior) and the prior information I(π)I(\pi). As the prior becomes flat (informationless), I(π)0I(\pi) \to 0 and the bound recovers a Bayes-averaged Cramér-Rao bound. As the data dominates, the prior information becomes negligible and the asymptotic Bayes risk matches the asymptotic frequentist Cramér-Rao bound.

Proof Sketch

Apply Cauchy-Schwarz to the joint score S(x,θ)=θlog[π(θ)p(xθ)]=π(θ)/π(θ)+s(x;θ)S(x, \theta) = \partial_\theta \log[\pi(\theta) p(x \mid \theta)] = \pi'(\theta)/\pi(\theta) + s(x; \theta), which has variance I(π)+nI(θ)I(\pi) + n I(\theta) under the joint distribution. The covariance Cov(θ^θ,S)\mathrm{Cov}(\hat\theta - \theta, S) equals 11 by an integration-by-parts argument that uses the prior vanishing at the boundary. Cauchy-Schwarz then gives the bound.

Why It Matters

The van Trees inequality is the standard tool for proving minimax lower bounds. Combined with a careful choice of prior (often supported on a small ball around the parameter of interest), it yields explicit lower bounds on the worst-case Bayes risk, which dualizes to a lower bound on the minimax risk. This is one of the two main routes to minimax lower bounds, the other being Le Cam's two-point method.

Failure Mode

The bound requires the prior to vanish at the boundary (so the integration-by-parts has no boundary term). Improper priors (uniform on the whole real line) violate this; a careful localization argument is needed. For multidimensional θ\theta, the matrix version replaces sums with the matrix sum Eπ[nI(θ)]+I(π)\mathbb E_\pi[n I(\theta)] + I(\pi) in the appropriate Loewner sense.

Achievability: When Does an Efficient Unbiased Estimator Exist?

The Cramér-Rao bound is achieved by an unbiased estimator at every θ\theta if and only if the family is a one-parameter exponential family in the natural parameterization, and θ^\hat\theta is the sample mean of the canonical sufficient statistic. The proof: equality in Cauchy-Schwarz requires θ^θ=c(θ)Sn\hat\theta - \theta = c(\theta) \cdot S_n for some function cc, which by integrating recovers the exponential-family form logp(xθ)=θT(x)A(θ)+h(x)\log p(x \mid \theta) = \theta T(x) - A(\theta) + h(x) with θ^=Tˉ=1niT(Xi)\hat\theta = \bar T = \frac{1}{n}\sum_i T(X_i) and the Cramér-Rao bound met as Var(Tˉ)=A(θ)/n=1/(nI(θ))\mathrm{Var}(\bar T) = A''(\theta)/n = 1/(nI(\theta)).

For other families, the best possible unbiased estimator is the uniformly minimum variance unbiased estimator (UMVUE), which is found by Rao-Blackwellizing any unbiased estimator with respect to a complete sufficient statistic (Rao-Blackwellization). The UMVUE is unique a.s. when a complete sufficient statistic exists (Lehmann-Scheffé). It need not attain the Cramér-Rao bound at finite nn; the gap Var(UMVUE)1/(nI)\mathrm{Var}(\mathrm{UMVUE}) - 1/(nI) is non-negative and shrinks at rate 1/n1/n in regular models.

In multi-parameter settings with nuisance parameters, the relevant lower bound is the efficient information I=IμμIμηIηη1IημI^* = I_{\mu\mu} - I_{\mu\eta} I_{\eta\eta}^{-1} I_{\eta\mu} for parameter of interest μ\mu in the presence of nuisance η\eta (Bickel-Klaassen-Ritov-Wellner Efficient and Adaptive Estimation, 1993). The MLE for μ\mu is asymptotically efficient with respect to (I)1(I^*)^{-1}, not (Iμμ)1(I_{\mu\mu})^{-1}. This is a strict information loss whenever Iμη0I_{\mu\eta} \neq 0.

Canonical Examples

Example

Normal mean with known variance

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2) with σ2\sigma^2 known. The score is s(x;μ)=(xμ)/σ2s(x; \mu) = (x - \mu)/\sigma^2, so I(μ)=1/σ2I(\mu) = 1/\sigma^2. The Cramér-Rao bound gives Var(μ^)σ2/n\mathrm{Var}(\hat\mu) \geq \sigma^2/n. The sample mean Xˉ\bar X has Var(Xˉ)=σ2/n\mathrm{Var}(\bar X) = \sigma^2/n, attaining the bound exactly. This is a one-parameter exponential family with T(x)=xT(x) = x, so finite-sample efficiency is expected.

Example

Normal mean and variance jointly: Loewner ordering

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2) with both unknown. The score vector is s=((xμ)/σ2,1/(2σ2)+(xμ)2/(2σ4))s = ((x-\mu)/\sigma^2, -1/(2\sigma^2) + (x-\mu)^2/(2\sigma^4)). Direct computation gives the Fisher information matrix

I(μ,σ2)=(1/σ2001/(2σ4)).I(\mu, \sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.

The matrix Cramér-Rao bound is Cov(μ^,σ2^)diag(σ2/n,2σ4/n)\mathrm{Cov}(\hat\mu, \hat{\sigma^2}) \succeq \mathrm{diag}(\sigma^2/n, 2\sigma^4/n). The unbiased estimators are Xˉ\bar X for μ\mu (variance σ2/n\sigma^2/n, attains the bound) and S2=1n1(XiXˉ)2S^2 = \frac{1}{n-1}\sum(X_i - \bar X)^2 for σ2\sigma^2 (variance 2σ4/(n1)>2σ4/n2\sigma^4/(n-1) > 2\sigma^4/n, strictly above the bound at finite nn, with the gap closing at rate 1/n1/n).

Example

Poisson rate

Let X1,,XnPoisson(λ)X_1, \ldots, X_n \sim \mathrm{Poisson}(\lambda). The score is s(x;λ)=x/λ1s(x; \lambda) = x/\lambda - 1, so I(λ)=1/λI(\lambda) = 1/\lambda. The Cramér-Rao bound is Var(λ^)λ/n\mathrm{Var}(\hat\lambda) \geq \lambda/n, and the sample mean Xˉ\bar X has Var(Xˉ)=λ/n\mathrm{Var}(\bar X) = \lambda/n. Attained, since this is a one-parameter exponential family with T(x)=xT(x) = x.

Example

Cauchy location: efficient unbiased estimator does not exist at finite n

Let X1,,XnCauchy(θ,1)X_1, \ldots, X_n \sim \mathrm{Cauchy}(\theta, 1) with density p(xθ)=1/[π(1+(xθ)2)]p(x \mid \theta) = 1/[\pi(1 + (x - \theta)^2)]. The score is s(x;θ)=2(xθ)/[1+(xθ)2]s(x; \theta) = 2(x - \theta)/[1 + (x - \theta)^2] and I(θ)=1/2I(\theta) = 1/2. The Cramér-Rao bound is Var(θ^)2/n\mathrm{Var}(\hat\theta) \geq 2/n. No closed-form unbiased estimator achieves this bound at finite nn (the Cauchy is not exponential family). The MLE θ^n\hat\theta_n achieves it asymptotically: n(θ^nθ)N(0,2)\sqrt n(\hat\theta_n - \theta) \Rightarrow \mathcal N(0, 2). Note the sample mean has infinite variance for Cauchy data and is not even a consistent estimator; only the MLE (or a robust estimator like the sample median) works.

Example

Uniform U(0, theta): Cramér-Rao does not apply, HCR does

Let X1,,XnU(0,θ)X_1, \ldots, X_n \sim U(0, \theta). The support depends on θ\theta, so Cramér regularity fails. The Cramér-Rao bound is silent. Use the Hammersley-Chapman-Robbins bound: for δ<0\delta < 0, χ2(U(0,θ+δ)U(0,θ))=δ/(θ+δ)\chi^2(U(0, \theta + \delta) \,\|\, U(0, \theta)) = -\delta/(\theta + \delta) for one observation, so for an i.i.d. sample of size nn the chi-squared divergence is (θ/(θ+δ))n1(\theta/(\theta + \delta))^n - 1. Optimizing δ2/[(θ/(θ+δ))n1]\delta^2 / [(\theta/(\theta + \delta))^n - 1] over δ\delta gives a bound of order θ2/n2\theta^2/n^2, matching the actual Θ(θ2/n2)\Theta(\theta^2/n^2) variance of the MLE θ^=X(n)\hat\theta = X_{(n)}.

Example

Linear regression beta hat is efficient

Let YXN(Xβ,σ2In)Y \mid X \sim \mathcal N(X\beta, \sigma^2 I_n) with XRn×dX \in \mathbb R^{n \times d} fixed full-rank. The Fisher information for β\beta (with σ2\sigma^2 known) is I(β)=XX/σ2I(\beta) = X^\top X / \sigma^2. The Cramér-Rao bound is Cov(β^)σ2(XX)1\mathrm{Cov}(\hat\beta) \succeq \sigma^2 (X^\top X)^{-1}. The OLS estimator β^OLS=(XX)1XY\hat\beta_{\mathrm{OLS}} = (X^\top X)^{-1} X^\top Y has covariance exactly σ2(XX)1\sigma^2 (X^\top X)^{-1}, attaining the bound. This is the Gauss-Markov theorem in its Cramér-Rao form: OLS is efficient under Gaussian noise (and BLUE under any noise with finite variance).

Common Confusions

Watch Out

Cramér-Rao does not bound MSE for biased estimators

The bound Var(θ^)1/(nI)\mathrm{Var}(\hat\theta) \geq 1/(nI) governs variance, not MSE. MSE = variance + bias². A biased estimator can have lower MSE than 1/(nI)1/(nI) when the bias is small and the variance is much smaller. James-Stein, ridge regression, LASSO, posterior means, and shrinkage estimators all do this systematically. The bound is vacuous as a statement about MSE for the biased class.

Watch Out

Asymptotic efficiency is weaker than finite-sample efficiency

The MLE is asymptotically efficient under standard regularity: n(θ^nθ)N(0,I(θ)1)\sqrt n (\hat\theta_n - \theta) \Rightarrow \mathcal N(0, I(\theta)^{-1}). This says the limiting variance equals the Cramér-Rao bound. It does not say the MLE attains the bound at any finite nn (it generally does not, except in one-parameter exponential families). Finite-sample variance is typically Var(θ^n)=1/(nI)+O(1/n2)\mathrm{Var}(\hat\theta_n) = 1/(nI) + O(1/n^2), with the second-order term computable via the Bartlett correction or Edgeworth expansion.

Watch Out

The bound applies to unbiased estimators of theta, not of g(theta)

The Cramér-Rao bound for the parameter θ\theta is Var(θ^)1/(nI(θ))\mathrm{Var}(\hat\theta) \geq 1/(nI(\theta)). For an unbiased estimator TT of a transformed parameter g(θ)g(\theta), the relevant bound is Var(T)[g(θ)]2/(nI(θ))\mathrm{Var}(T) \geq [g'(\theta)]^2 / (nI(\theta)), which is the chain-rule version. Conflating the two is a common error: e.g. estimating σ2\sigma^2 vs σ\sigma requires applying the chain rule with g(σ2)=σ2g(\sigma^2) = \sqrt{\sigma^2}, g(σ2)=1/(2σ2)g'(\sigma^2) = 1/(2\sqrt{\sigma^2}).

Watch Out

The bound is local at theta, not uniform in theta

The Cramér-Rao bound is a function of θ\theta: Var(θ^)1/(nI(θ))\mathrm{Var}(\hat\theta) \geq 1/(nI(\theta)). An estimator can attain the bound at one value of θ\theta and exceed it at others; the uniform attainability characterizes one-parameter exponential families. For minimax-optimal inference (worst-case over θ\theta), use the van Trees inequality or Le Cam's method instead.

Watch Out

With nuisance parameters, the right denominator is efficient information

For a parameter of interest μ\mu with a nuisance parameter η\eta, the asymptotic variance of any regular estimator of μ\mu is bounded below by (I)1(I^*)^{-1} where I=IμμIμηIηη1IημI^* = I_{\mu\mu} - I_{\mu\eta} I_{\eta\eta}^{-1} I_{\eta\mu} is the Schur complement (the efficient information). Using (Iμμ)1(I_{\mu\mu})^{-1} (the marginal Fisher information) is wrong whenever Iμη0I_{\mu\eta} \neq 0; it under-states the achievable variance. The MLE for μ\mu is asymptotically efficient with respect to (I)1(I^*)^{-1}.

Watch Out

Singular Fisher matrix means the model is non-identifiable in some direction

A singular Fisher information matrix I(θ)I(\theta) has zero eigenvalues, indicating directions in parameter space along which the likelihood is locally flat. No data set, however large, can distinguish parameters along these directions; the Cramér-Rao bound is infinite (vacuous) in those directions. Either reparameterize to identify only the estimable contrasts, or impose regularization (a prior, a sparsity constraint) to pin down the model.

Exercises

ExerciseCore

Problem

Compute the Fisher information for XBernoulli(p)X \sim \mathrm{Bernoulli}(p) and state the Cramér-Rao bound for estimating pp from nn i.i.d. observations. Verify that Xˉ=1nXi\bar X = \frac{1}{n}\sum X_i is efficient.

ExerciseCore

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. from a one-parameter exponential family in canonical form: p(xη)=h(x)exp(ηT(x)A(η))p(x \mid \eta) = h(x) \exp(\eta T(x) - A(\eta)). Let θ^=Tˉ=1niT(Xi)\hat\theta = \bar T = \frac{1}{n}\sum_i T(X_i) estimate μ(η)=A(η)=Eη[T(X)]\mu(\eta) = A'(\eta) = \mathbb E_\eta[T(X)].

(a) Show θ^\hat\theta is unbiased for μ(η)\mu(\eta). (b) Compute I(η)I(\eta) for one observation. (c) Show Var(θ^)=A(η)/n\mathrm{Var}(\hat\theta) = A''(\eta)/n and that this attains the chain-rule Cramér-Rao bound for estimating μ(η)\mu(\eta).

ExerciseAdvanced

Problem

Show that the Cramér-Rao bound for estimating σ2\sigma^2 from X1,,XnN(0,σ2)X_1, \ldots, X_n \sim \mathcal N(0, \sigma^2) is 2σ4/n2\sigma^4/n. The unbiased estimator σ^2=1nXi2\hat\sigma^2 = \frac{1}{n}\sum X_i^2 has variance 2σ4/n2\sigma^4/n — exactly attaining the bound. Now consider estimating σ\sigma (not σ2\sigma^2). Compute the chain-rule Cramér-Rao bound for σ\sigma, and show that the unbiased estimator cnXi2c_n \sqrt{\sum X_i^2} (with cnc_n chosen for unbiasedness) has variance strictly larger than this bound at finite nn.

ExerciseAdvanced

Problem

Consider the location family X1,,Xnp(xθ)X_1, \ldots, X_n \sim p(x - \theta) for some symmetric, smooth density pp. Show that the Fisher information is I(θ)=(p(z))2/p(z)dzI(\theta) = \int (p'(z))^2 / p(z) dz, independent of θ\theta. Compute I(θ)I(\theta) for pp = Gaussian, Laplace, and Cauchy. Comment on which has the highest Fisher information per observation.

ExerciseResearch

Problem

Use the van Trees inequality to derive a lower bound on the minimax risk infθ^supθ[0,1]Eθ[(θ^θ)2]\inf_{\hat\theta} \sup_{\theta \in [0, 1]} \mathbb E_\theta[(\hat\theta - \theta)^2] for estimating θ\theta from X1,,XnBernoulli(θ)X_1, \ldots, X_n \sim \mathrm{Bernoulli}(\theta) with θ[0,1]\theta \in [0, 1]. Choose a smooth prior supported on a small ball around any fixed θ0(0,1)\theta_0 \in (0, 1), apply van Trees, then take the supremum over θ0\theta_0.

Related Comparisons

References

Canonical:

  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury. Chapter 7.3 (Cramér-Rao inequality and efficiency).
  • Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. Chapter 2 (unbiasedness, Cramér-Rao, UMVUE), Chapter 6 (multivariate Cramér-Rao).
  • Schervish, M. J. (1995). Theory of Statistics. Springer. Section 2.3 (information inequalities, Bhattacharyya and HCR bounds).
  • Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Original treatment of the inequality (Section 32.3).
  • Rao, C. R. (1945). "Information and the accuracy attainable in the estimation of statistical parameters." Bulletin of the Calcutta Mathematical Society, 37, 81-91. The other half of the eponymous bound.

Current:

  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Chapter 8 (efficient estimators, asymptotic Cramér-Rao bound), Chapter 25 (semiparametric efficient information).
  • Bickel, P. J., Klaassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press. Chapter 3 (efficient information with nuisance parameters; the modern reference for II^*).
  • Keener, R. W. (2010). Theoretical Statistics. Springer. Chapter 3 (unbiased estimation and efficiency), Chapter 4 (UMVUE via Lehmann-Scheffé).
  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. Chapter 11.10 (Fisher information and the Cramér-Rao bound, KL connection).
  • Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer. Chapter 2 (van Trees inequality and Le Cam's two-point method for minimax lower bounds).

Critique and refinements:

  • Hammersley, J. M. (1950). "On estimating restricted parameters." Journal of the Royal Statistical Society B, 12, 192-240. The HCR bound.
  • Chapman, D. G., & Robbins, H. (1951). "Minimum variance estimation without regularity assumptions." Annals of Mathematical Statistics, 22, 581-586. The other half of the eponymous bound.
  • Bhattacharyya, A. (1946). "On some analogues of the amount of information and their use in statistical estimation." Sankhyā, 8, 1-14. Higher-order information bounds.
  • van Trees, H. L. (1968). Detection, Estimation, and Modulation Theory, Part I. Wiley. The Bayesian Cramér-Rao inequality (Chapter 2).
  • Stein, C. (1956). "Inadmissibility of the usual estimator for the mean of a multivariate normal distribution." Proceedings of the Third Berkeley Symposium, 1, 197-206. Shows that biased shrinkage estimators beat the Cramér-Rao-efficient sample mean in MSE for d3d \geq 3. See shrinkage estimation.
  • Brown, L. D., & Gajek, L. (1990). "Information inequalities for the Bayes risk." Annals of Statistics, 18, 1578-1594. Sharper Bayesian information inequalities than van Trees in some settings.

Next Topics

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.