Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants

Sneiderman, Robby

Statistical Estimation

Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants

The fundamental information inequality for unbiased estimation. Coverage of the scalar and multivariate Cramér-Rao bounds, the chain rule for biased estimators, achievability in exponential families, the Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound (no regularity required), the van Trees Bayesian inequality, and efficient information with nuisance parameters.

AdvancedCoreTier 1StableCore spine~90 min

For:MLStats

Prerequisites

Fisher Information Maximum Likelihood Estimation KL Divergence

Quiz (8)Pulse Check Prereq Map

Why This Matters

theorem visual

Variance Floor

$The Cramér-Rao bound turns local information into a hard variance floor: sharper likelihood geometry means unbiased estimators can concentrate more tightly, but no unbiased procedure beats the inverse-information barrier.$

Information per sample

information level = medium

Variance floor

Var_{θ} (\hat{θ}) \geq \frac{1}{n I ( θ )}

Efficient estimators

When an estimator reaches the bound asymptotically, it is squeezing all available information out of the model without leaving variance on the table.

Important caveat

The classical floor is for unbiased estimators under regularity assumptions. Biased procedures can look better numerically, but then the right benchmark is a biased Cramér-Rao bound or a different risk criterion altogether.

The Cramér-Rao bound is the universal floor on the variance of any unbiased estimator. For an i.i.d. sample of size $n$ from $p(x \mid \theta)$ ,

$\mathrm{Var}_\theta(\hat\theta) \geq \frac{1}{n I(\theta)}$

where $I(\theta)$ is the Fisher information per observation. The bound has three roles in modern statistics. First, it certifies efficiency: the maximum likelihood estimator attains this bound asymptotically, which is why MLE is the default. Second, it grades estimator design: any concrete unbiased estimator can be compared to $1/(nI(\theta))$ to see how much room remains. Third, it links estimation to information geometry: $I(\theta)$ is the Riemannian metric on the statistical manifold, and the bound is the metric-induced lower bound on the squared geodesic length of any unbiased displacement.

The bound is one Cauchy-Schwarz inequality applied to the score function. The proof is short, the implications are deep, and the failure modes (biased estimators, non-regular families, nuisance parameters, finite-sample sharpness) generate a family of refinements: Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound (no regularity required), the van Trees Bayesian inequality, the multivariate Loewner ordering bound, and the BKRW (Bickel-Klaassen-Ritov-Wellner) semiparametric efficient information.

Mental Model

Think of the score $s(x; \theta) = \partial_\theta \log p(x \mid \theta)$ as a direction in the tangent space at $\theta$ . The expected score is zero (the first Bartlett identity); the variance of the score is the Fisher information $I(\theta)$ (the second Bartlett identity). Any unbiased estimator $\hat\theta$ satisfies $\mathbb E[\hat\theta] = \theta$ , so differentiating both sides and exchanging differentiation with integration gives $\mathrm{Cov}(\hat\theta, S_n) = 1$ , where $S_n = \sum_i s(X_i; \theta)$ is the total score. Cauchy-Schwarz on this covariance and on $\mathrm{Var}(S_n) = nI(\theta)$ closes the bound:

$1 = [\mathrm{Cov}(\hat\theta, S_n)]^2 \leq \mathrm{Var}(\hat\theta) \cdot nI(\theta).$

Equality holds iff $\hat\theta - \theta$ is collinear with the score, which forces an exponential-family structure: only one-parameter exponential families (in their canonical sufficient statistic) admit a finite-sample efficient unbiased estimator. Outside that special case, efficiency is asymptotic, not finite-sample.

The bound generalizes in several directions: the multivariate version replaces scalars with positive-definite matrices in the Loewner order; the biased-estimator version multiplies by the squared bias derivative; the Hammersley-Chapman-Robbins replaces the score with a finite difference and removes the regularity requirement; the van Trees inequality averages over a prior; the efficient-information version projects out nuisance directions. All of these are one Cauchy-Schwarz inequality applied to a slightly different inner-product structure.

Formal Setup

Let $X_1, \ldots, X_n$ be i.i.d. from $p(x \mid \theta)$ with $\theta \in \Theta \subseteq \mathbb R^d$ . Let $\ell_n(\theta) = \sum_{i=1}^n \log p(X_i \mid \theta)$ denote the log-likelihood and $S_n(\theta) = \nabla_\theta \ell_n(\theta)$ the score.

Definition

Score function $s (x; θ)$

The score function is the gradient of the log-density:

$s(x; \theta) = \nabla_\theta \log p(x \mid \theta).$

Under the regularity conditions below (support of $p$ does not depend on $\theta$ and differentiation passes through the integral), $\mathbb E_\theta[s(X; \theta)] = 0$ for all $\theta$ . This is the first Bartlett identity.

Definition

Fisher information $I (θ)$

The Fisher information matrix is the covariance of the score:

$I(\theta) = \mathrm{Cov}_\theta[s(X; \theta)] = \mathbb E_\theta[s(X; \theta) s(X; \theta)^\top].$

Under twice-differentiability and regularity, $I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)]$ (the second Bartlett identity; see maximum likelihood estimation for the proof). For an i.i.d. sample of size $n$ , the total Fisher information is $n I(\theta)$ .

Definition

Regularity conditions $Cram \overset{e}{ˊ} r regularity$

A parametric family $\{p(\cdot \mid \theta)\}$ is Cramér-regular at $\theta_0$ if and only if (1) the support of $p$ does not depend on $\theta$ in a neighborhood of $\theta_0$ , (2) $\log p(x \mid \theta)$ is twice continuously differentiable in $\theta$ for almost every $x$ , (3) differentiation of $\int p(x \mid \theta) dx = 1$ with respect to $\theta$ can be performed under the integral sign (e.g. by dominated convergence), and (4) the Fisher information $I(\theta_0)$ is finite and positive definite. The Cramér-Rao bound assumes these conditions.

Definition

Efficient estimator $Var (\hat{θ}) = 1/ (n I (θ))$

An unbiased estimator $\hat\theta$ of a scalar parameter $\theta$ is finite-sample efficient if and only if it attains the Cramér-Rao lower bound with equality at every $n$ and every $\theta$ : $\mathrm{Var}_\theta(\hat\theta) = 1/(nI(\theta))$ . An estimator $\hat\theta_n$ is asymptotically efficient if and only if $\sqrt n (\hat\theta_n - \theta) \Rightarrow \mathcal N(0, I(\theta)^{-1})$ . The MLE is asymptotically efficient under standard regularity (see asymptotic statistics); finite-sample efficiency is much rarer and characterizes one-parameter exponential families.

The Cramér-Rao Bound

Theorem

Cramér-Rao Lower Bound (Scalar)

Statement

For any unbiased estimator $\hat\theta$ of $\theta$ ,

$\mathrm{Var}_\theta(\hat\theta) \geq \frac{1}{n I(\theta)}.$

Equality at a fixed $\theta$ holds if and only if there exists a function $a(\theta) \neq 0$ such that

$\sum_{i=1}^n s(X_i; \theta) = a(\theta) (\hat\theta - \theta) \quad \mathbb P_\theta\text{-a.s.}$

Equality at every $\theta$ in an open neighborhood characterizes one-parameter exponential families.

Intuition

The score is the direction in which the log-likelihood rises fastest. An unbiased estimator must be sensitive to the parameter in the same direction (since perturbing $\theta$ shifts $\mathbb E[\hat\theta]$ by exactly the same amount). Cauchy-Schwarz says this collinearity has a cost: the unbiased estimator's variance must absorb at least the inverse of the score's variance. High Fisher information means the score is large in magnitude and tightly concentrated, so the data is informative and small variance is achievable.

Proof Sketch

Apply Cauchy-Schwarz to $\mathrm{Cov}(\hat\theta, S_n)$ with $S_n = \sum_{i=1}^n s(X_i; \theta)$ :

$[\mathrm{Cov}(\hat\theta, S_n)]^2 \leq \mathrm{Var}(\hat\theta) \cdot \mathrm{Var}(S_n).$

Step 1. Differentiate $\mathbb E_\theta[\hat\theta] = \theta$ in $\theta$ and exchange differentiation with integration (valid by Cramér regularity):

$1 = \frac{d}{d\theta} \int \hat\theta(x) p(x \mid \theta) \, dx = \int \hat\theta(x) \, s(x; \theta) p(x \mid \theta) \, dx = \mathbb E[\hat\theta \cdot s(X; \theta)] = \mathrm{Cov}(\hat\theta, S_1)$

(the last equality because $\mathbb E[s] = 0$ ). For the i.i.d. sum, $\mathrm{Cov}(\hat\theta, S_n) = 1$ by linearity since only the score on the same sample contributes.

Step 2. $\mathrm{Var}(S_n) = n I(\theta)$ since the per-sample scores are i.i.d. with variance $I(\theta)$ .

Step 3. Substituting into Cauchy-Schwarz: $1 \leq \mathrm{Var}(\hat\theta) \cdot n I(\theta)$ .

Equality in Cauchy-Schwarz requires $\hat\theta - \mathbb E[\hat\theta] = c \cdot S_n$ a.s. for some non-random $c$ , which translates to the collinearity condition stated.

Why It Matters

This is the central inequality of classical estimation theory. Combined with the asymptotic efficiency of the MLE (see asymptotic statistics), it identifies $1/(nI(\theta))$ as the universal asymptotic floor for unbiased estimation under regularity. Every alternative estimator has to be benchmarked against it: if $\mathrm{Var}(\hat\theta_{\mathrm{alt}}) > 1/(nI(\theta))$ asymptotically, the MLE wins; if it equals it, both are asymptotically efficient. The bound also identifies the parameters that are hard to estimate (small $I(\theta)$ ) and the ones that are easy (large $I(\theta)$ ).

Failure Mode

Biased estimators escape the bound. The James-Stein estimator beats the sample mean in MSE for $d \geq 3$ ; ridge regression, LASSO, and posterior mean estimators all introduce bias in exchange for variance reduction. The Cramér-Rao bound governs only the unbiased class.

Non-regular families. For $X \sim U(0, \theta)$ the support depends on $\theta$ , $\partial_\theta \log p$ does not exist almost everywhere, and the bound does not apply. The MLE $\hat\theta = X_{(n)}$ has variance $\theta^2 / [n(n+2)] \cdot (1 + o(1))$ , which is $\Theta(\theta^2 / n^2)$ , faster than the $1/n$ parametric rate. The bound is silent here; use the Hammersley-Chapman-Robbins bound instead.

Bound not attained at finite $n$ . Outside one-parameter exponential families, no unbiased estimator attains the bound at every $n$ . For example, estimating $\sigma$ from $X_1, \ldots, X_n \sim \mathcal N(0, \sigma^2)$ : the unbiased estimator $\hat\sigma = c_n \sqrt{\sum X_i^2}$ has variance strictly larger than $1/(nI(\sigma)) = \sigma^2/(2n)$ at finite $n$ , with the gap shrinking at rate $1/n$ .

report a correction →

Multivariate Form

Theorem

Multivariate Cramér-Rao Bound

Statement

For any unbiased estimator $\hat\theta$ of $\theta \in \mathbb R^d$ ,

$\mathrm{Cov}_\theta(\hat\theta) \succeq \frac{1}{n} I(\theta)^{-1}$

in the Loewner (positive semidefinite) ordering: $\mathrm{Cov}(\hat\theta) - \frac{1}{n} I(\theta)^{-1}$ is positive semidefinite. Equivalently, for every $v \in \mathbb R^d$ ,

$\mathrm{Var}_\theta(v^\top \hat\theta) \geq \frac{1}{n} v^\top I(\theta)^{-1} v.$

Intuition

The matrix bound is the simultaneous lower bound on the variance of every linear combination of the parameter estimates. Parameters that are jointly poorly identified (small eigenvalues of $I(\theta)$ in their direction) have correspondingly large minimum-variance bounds. The inverse Fisher matrix $I(\theta)^{-1}$ is the covariance template that the optimal unbiased estimator must equal or exceed in every direction.

Proof Sketch

Apply the matrix Cauchy-Schwarz inequality to the score-vector $S_n$ and parameter-vector $\hat\theta$ . Differentiating $\mathbb E_\theta[\hat\theta] = \theta$ component-wise gives $\mathrm{Cov}(\hat\theta, S_n) = I_d$ (identity matrix). The total score has covariance $\mathrm{Cov}(S_n) = n I(\theta)$ . Matrix Cauchy-Schwarz then gives $\mathrm{Cov}(\hat\theta) \succeq \mathrm{Cov}(\hat\theta, S_n) [\mathrm{Cov}(S_n)]^{-1} \mathrm{Cov}(S_n, \hat\theta) = (n I(\theta))^{-1}$ . Specializing to a fixed direction $v$ , $\mathrm{Var}(v^\top \hat\theta) = v^\top \mathrm{Cov}(\hat\theta) v \geq v^\top I(\theta)^{-1} v / n$ . Note this is not $1/(v^\top I(\theta) v)$ : the inverse comes after the contraction with $v$ , not before. The two coincide only when $v$ is a Fisher eigenvector and $I$ is diagonal in that direction; in general they differ by the Schur-complement penalty for the orthogonal nuisance directions.

Why It Matters

Real models almost always have multiple parameters. The matrix bound governs how trade-offs between parameters propagate. A model whose Fisher matrix has a small eigenvalue is weakly identified in the corresponding direction: any unbiased estimator must accept large variance in that direction. This connects directly to ill-conditioned regression and to the natural gradient (which preconditions descent directions by $I^{-1}$ to remove this distortion).

Failure Mode

The bound applies to unbiased estimators only. The multivariate biased-estimator version (next theorem) replaces $I^{-1}$ with $\mathcal J I^{-1} \mathcal J^\top$ where $\mathcal J = \nabla_\theta \mathbb E[\hat\theta]$ is the Jacobian of the bias-corrected mean. When $I(\theta)$ is singular (rank-deficient, i.e. the model is non-identifiable in some direction), the bound is vacuous in that direction; use a generalized inverse $I^+$ and restrict to identifiable contrasts.

report a correction →

Biased Estimators: The Chain-Rule Bound

Theorem

Cramér-Rao Bound for Biased Estimators

Statement

For any (possibly biased) estimator $T(X)$ with mean $\psi(\theta) = \mathbb E_\theta[T(X)]$ ,

$\mathrm{Cov}_\theta(T) \succeq \frac{1}{n} \mathcal J(\theta) I(\theta)^{-1} \mathcal J(\theta)^\top$

where $\mathcal J(\theta) = \nabla_\theta \psi(\theta)$ is the Jacobian of $\psi$ . In the scalar case this reduces to

$\mathrm{Var}_\theta(T) \geq \frac{[\psi'(\theta)]^2}{n I(\theta)}.$

Intuition

A biased estimator's mean shifts at rate $\psi'(\theta)$ as $\theta$ changes (not at rate $1$ as for unbiased estimators). The Cauchy-Schwarz argument now gives a covariance of $\psi'(\theta)$ between $T$ and the score, so the squared-covariance side of Cauchy-Schwarz becomes $[\psi'(\theta)]^2$ instead of $1$ . The bound scales accordingly.

Proof Sketch

Repeat the scalar proof with $T$ replacing $\hat\theta$ . Differentiating $\mathbb E_\theta[T] = \psi(\theta)$ gives $\mathrm{Cov}(T, S_n) = n \cdot \psi'(\theta) / n = \psi'(\theta)$ by the same exchange-of-differentiation step. Cauchy-Schwarz then gives $[\psi'(\theta)]^2 \leq \mathrm{Var}(T) \cdot n I(\theta)$ .

Why It Matters

This is the bound that practitioners actually need. Most useful estimators are biased: ridge, LASSO, posterior means, shrinkage, kernel-smoothed estimates. The chain-rule form gives the variance lower bound for any such estimator, parameterized by how its mean depends on $\theta$ . It also justifies the delta method: an estimator of $g(\theta)$ has variance lower-bounded by $[g'(\theta)]^2 / (n I(\theta))$ , which matches the asymptotic variance of $g(\hat\theta_{\mathrm{MLE}})$ exactly under standard regularity.

Failure Mode

The chain-rule bound does not directly bound MSE, only variance. To bound MSE, add the squared bias: $\mathrm{MSE}(T) = \mathrm{Var}(T) + [\psi(\theta) - \theta]^2 \geq [\psi'(\theta)]^2 / (n I(\theta)) + [\psi(\theta) - \theta]^2$ . The James-Stein estimator beats the sample mean precisely because its bias is small while its variance reduction is large; the Cramér-Rao bound in either form does not preclude this.

report a correction →

Sharper Bounds: Bhattacharyya and Hammersley-Chapman-Robbins

The Cramér-Rao bound uses only the first derivative of the log-likelihood. Two sharper bounds use higher-order information or no derivatives at all.

Theorem

Bhattacharyya Higher-Order Bound

Statement

Let $S^{(j)}(\theta) = \partial^j \log p(X \mid \theta) / \partial \theta^j$ for $j = 1, \ldots, k$ . Define the Gram matrix $\mathcal B(\theta)$ with entries $\mathcal B_{ij} = \mathbb E[S^{(i)} S^{(j)}]$ (the Bhattacharyya information matrix). Then for any unbiased $\hat\theta$ ,

$\mathrm{Var}_\theta(\hat\theta) \geq \frac{1}{n} \mathbf 1^\top \mathcal B(\theta)^{-1} \mathbf 1$

where $\mathbf 1 = (1, 0, \ldots, 0)^\top \in \mathbb R^k$ (since only the first-order moment of $\hat\theta$ depends on $\theta$ ). The $k=1$ case recovers the standard Cramér-Rao bound; $k \geq 2$ gives strictly tighter bounds when the higher-order information matrix is non-degenerate.

Intuition

The standard Cramér-Rao bound projects $\hat\theta$ onto the first score $S^{(1)}$ alone. The Bhattacharyya bound projects onto the larger linear span $\{S^{(1)}, S^{(2)}, \ldots, S^{(k)}\}$ of higher-order log-likelihood derivatives, capturing more of the unbiased estimator's structure. The bound is sharper because the projection onto a larger subspace has equal-or-larger norm.

Proof Sketch

For each $j$ , differentiate $\mathbb E[\hat\theta] = \theta$ exactly $j$ times. The $j=1$ derivative gives $\mathrm{Cov}(\hat\theta, S^{(1)}) = 1$ ; higher-order derivatives give $\mathrm{Cov}(\hat\theta, S^{(j)}) = 0$ for $j \geq 2$ (since $\theta$ is linear in $\theta$ ). Apply Cauchy-Schwarz to the projection of $\hat\theta - \theta$ onto $\mathrm{span}(S^{(1)}, \ldots, S^{(k)})$ using the Gram inverse $\mathcal B^{-1}$ .

Why It Matters

Bhattacharyya bounds quantify how much sharpness is left on the table by the standard Cramér-Rao bound. In some non-exponential models, the gap is large at finite $n$ . The bound also generalizes naturally to the multivariate case (replace $\mathbf 1$ by an identity-block selection matrix).

Failure Mode

For one-parameter exponential families, all higher-order Bhattacharyya bounds collapse to the Cramér-Rao bound (the higher-order derivatives are linearly dependent on the first), so no sharpening is possible. Bhattacharyya helps in non-exponential families where the bound is not attained by the Cramér-Rao argument.

report a correction →

Theorem

Hammersley-Chapman-Robbins Bound (no regularity)

Statement

For any unbiased estimator $\hat\theta$ of $\theta$ and any $\delta \neq 0$ such that $p(x \mid \theta + \delta)$ and $p(x \mid \theta)$ are mutually absolutely continuous,

$\mathrm{Var}_\theta(\hat\theta) \geq \sup_{\delta \neq 0} \frac{\delta^2}{\chi^2(p_{\theta + \delta} \,\|\, p_\theta)}$

where $\chi^2(q \,\|\, p) = \mathbb E_p[(q/p - 1)^2] = \int (q - p)^2 / p \, dx$ is the chi-squared divergence. As $\delta \to 0$ in a Cramér-regular family, the bound recovers the Cramér-Rao bound (since $\chi^2(p_{\theta + \delta} \,\|\, p_\theta) = \delta^2 I(\theta) + O(\delta^4)$ ).

Intuition

Replace the score (a derivative) with a finite difference. The bound asks: how much can the parameter shift by $\delta$ before the data distribution becomes too distinguishable? An unbiased estimator that tracks $\theta$ exactly must also track shifts $\theta \mapsto \theta + \delta$ , so its variance must be at least $\delta^2$ times a precision factor measured by chi-squared distance. The supremum over $\delta$ tightens the bound; for small $\delta$ the bound recovers the local Cramér-Rao bound, for large $\delta$ it can be tighter or sharper at finite $n$ .

Proof Sketch

Define the likelihood ratio $L(x) = p(x \mid \theta + \delta) / p(x \mid \theta) - 1$ . Since $\hat\theta$ is unbiased, $\mathbb E_\theta[\hat\theta L(X)] = \mathbb E_{\theta + \delta}[\hat\theta] - \mathbb E_\theta[\hat\theta] = (\theta + \delta) - \theta = \delta$ . Apply Cauchy-Schwarz:

$\delta^2 = [\mathrm{Cov}_\theta(\hat\theta, L)]^2 \leq \mathrm{Var}_\theta(\hat\theta) \cdot \mathrm{Var}_\theta(L) = \mathrm{Var}_\theta(\hat\theta) \cdot \chi^2(p_{\theta + \delta} \,\|\, p_\theta).$

Rearrange and take the sup over $\delta$ .

Why It Matters

This is the bound to use when Cramér regularity fails. For $X \sim U(0, \theta)$ , the chi-squared divergence between $U(0, \theta)$ and $U(0, \theta + \delta)$ is finite for $\delta < 0$ (and infinite for $\delta > 0$ ), and computing the supremum gives a non-trivial lower bound on the variance of any unbiased estimator. The HCR bound also handles discrete parameters, irregular boundaries, and any setting where the score is undefined or unbounded.

Failure Mode

The chi-squared divergence can be infinite (e.g. when supports are disjoint), in which case the bound for that $\delta$ is zero (vacuous). Choose $\delta$ small enough that $\chi^2$ is finite. For complicated parametric families, the HCR bound is hard to compute analytically; the Cramér-Rao bound is preferred when regularity holds.

report a correction →

The Bayesian Extension: van Trees Inequality

Theorem

van Trees (Bayesian Cramér-Rao) Inequality

Statement

For any estimator $\hat\theta(X)$ of $\theta$ (no unbiasedness required),

$\mathbb E_\pi \mathbb E_\theta[(\hat\theta - \theta)^2] \geq \frac{1}{\mathbb E_\pi[n I(\theta)] + I(\pi)}$

where $I(\pi) = \int (\pi'(\theta) / \pi(\theta))^2 \pi(\theta) \, d\theta$ is the Fisher information of the prior with respect to the location family.

Intuition

The Bayesian Cramér-Rao bound combines two information sources: the data information $\mathbb E_\pi[n I(\theta)]$ (averaged over the prior) and the prior information $I(\pi)$ . As the prior becomes flat (informationless), $I(\pi) \to 0$ and the bound recovers a Bayes-averaged Cramér-Rao bound. As the data dominates, the prior information becomes negligible and the asymptotic Bayes risk matches the asymptotic frequentist Cramér-Rao bound.

Proof Sketch

Apply Cauchy-Schwarz to the joint score $S(x, \theta) = \partial_\theta \log[\pi(\theta) p(x \mid \theta)] = \pi'(\theta)/\pi(\theta) + s(x; \theta)$ , which has variance $I(\pi) + n I(\theta)$ under the joint distribution. The covariance $\mathrm{Cov}(\hat\theta - \theta, S)$ equals $1$ by an integration-by-parts argument that uses the prior vanishing at the boundary. Cauchy-Schwarz then gives the bound.

Why It Matters

The van Trees inequality is the standard tool for proving minimax lower bounds. Combined with a careful choice of prior (often supported on a small ball around the parameter of interest), it yields explicit lower bounds on the worst-case Bayes risk, which dualizes to a lower bound on the minimax risk. This is one of the two main routes to minimax lower bounds, the other being Le Cam's two-point method.

Failure Mode

The bound requires the prior to vanish at the boundary (so the integration-by-parts has no boundary term). Improper priors (uniform on the whole real line) violate this; a careful localization argument is needed. For multidimensional $\theta$ , the matrix version replaces sums with the matrix sum $\mathbb E_\pi[n I(\theta)] + I(\pi)$ in the appropriate Loewner sense.

report a correction →

Achievability: When Does an Efficient Unbiased Estimator Exist?

The Cramér-Rao bound is achieved by an unbiased estimator at every $\theta$ if and only if the family is a one-parameter exponential family in the natural parameterization, and $\hat\theta$ is the sample mean of the canonical sufficient statistic. The proof: equality in Cauchy-Schwarz requires $\hat\theta - \theta = c(\theta) \cdot S_n$ for some function $c$ , which by integrating recovers the exponential-family form $\log p(x \mid \theta) = \theta T(x) - A(\theta) + h(x)$ with $\hat\theta = \bar T = \frac{1}{n}\sum_i T(X_i)$ and the Cramér-Rao bound met as $\mathrm{Var}(\bar T) = A''(\theta)/n = 1/(nI(\theta))$ .

For other families, the best possible unbiased estimator is the uniformly minimum variance unbiased estimator (UMVUE), which is found by Rao-Blackwellizing any unbiased estimator with respect to a complete sufficient statistic (Rao-Blackwellization). The UMVUE is unique a.s. when a complete sufficient statistic exists (Lehmann-Scheffé). It need not attain the Cramér-Rao bound at finite $n$ ; the gap $\mathrm{Var}(\mathrm{UMVUE}) - 1/(nI)$ is non-negative and shrinks at rate $1/n$ in regular models.

In multi-parameter settings with nuisance parameters, the relevant lower bound is the efficient information $I^* = I_{\mu\mu} - I_{\mu\eta} I_{\eta\eta}^{-1} I_{\eta\mu}$ for parameter of interest $\mu$ in the presence of nuisance $\eta$ (Bickel-Klaassen-Ritov-Wellner Efficient and Adaptive Estimation, 1993). The MLE for $\mu$ is asymptotically efficient with respect to $(I^*)^{-1}$ , not $(I_{\mu\mu})^{-1}$ . This is a strict information loss whenever $I_{\mu\eta} \neq 0$ .

Canonical Examples

Example

Normal mean with known variance

Let $X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2)$ with $\sigma^2$ known. The score is $s(x; \mu) = (x - \mu)/\sigma^2$ , so $I(\mu) = 1/\sigma^2$ . The Cramér-Rao bound gives $\mathrm{Var}(\hat\mu) \geq \sigma^2/n$ . The sample mean $\bar X$ has $\mathrm{Var}(\bar X) = \sigma^2/n$ , attaining the bound exactly. This is a one-parameter exponential family with $T(x) = x$ , so finite-sample efficiency is expected.

Example

Normal mean and variance jointly: Loewner ordering

Let $X_1, \ldots, X_n \sim \mathcal N(\mu, \sigma^2)$ with both unknown. The score vector is $s = ((x-\mu)/\sigma^2, -1/(2\sigma^2) + (x-\mu)^2/(2\sigma^4))$ . Direct computation gives the Fisher information matrix

$I(\mu, \sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.$

The matrix Cramér-Rao bound is $\mathrm{Cov}(\hat\mu, \hat{\sigma^2}) \succeq \mathrm{diag}(\sigma^2/n, 2\sigma^4/n)$ . The unbiased estimators are $\bar X$ for $\mu$ (variance $\sigma^2/n$ , attains the bound) and $S^2 = \frac{1}{n-1}\sum(X_i - \bar X)^2$ for $\sigma^2$ (variance $2\sigma^4/(n-1) > 2\sigma^4/n$ , strictly above the bound at finite $n$ , with the gap closing at rate $1/n$ ).

Example

Poisson rate

Let $X_1, \ldots, X_n \sim \mathrm{Poisson}(\lambda)$ . The score is $s(x; \lambda) = x/\lambda - 1$ , so $I(\lambda) = 1/\lambda$ . The Cramér-Rao bound is $\mathrm{Var}(\hat\lambda) \geq \lambda/n$ , and the sample mean $\bar X$ has $\mathrm{Var}(\bar X) = \lambda/n$ . Attained, since this is a one-parameter exponential family with $T(x) = x$ .

Example

Cauchy location: efficient unbiased estimator does not exist at finite n

Let $X_1, \ldots, X_n \sim \mathrm{Cauchy}(\theta, 1)$ with density $p(x \mid \theta) = 1/[\pi(1 + (x - \theta)^2)]$ . The score is $s(x; \theta) = 2(x - \theta)/[1 + (x - \theta)^2]$ and $I(\theta) = 1/2$ . The Cramér-Rao bound is $\mathrm{Var}(\hat\theta) \geq 2/n$ . No closed-form unbiased estimator achieves this bound at finite $n$ (the Cauchy is not exponential family). The MLE $\hat\theta_n$ achieves it asymptotically: $\sqrt n(\hat\theta_n - \theta) \Rightarrow \mathcal N(0, 2)$ . Note the sample mean has infinite variance for Cauchy data and is not even a consistent estimator; only the MLE (or a robust estimator like the sample median) works.

Example

Uniform U(0, theta): Cramér-Rao does not apply, HCR does

Let $X_1, \ldots, X_n \sim U(0, \theta)$ . The support depends on $\theta$ , so Cramér regularity fails. The Cramér-Rao bound is silent. Use the Hammersley-Chapman-Robbins bound: for $\delta < 0$ , $\chi^2(U(0, \theta + \delta) \,\|\, U(0, \theta)) = -\delta/(\theta + \delta)$ for one observation, so for an i.i.d. sample of size $n$ the chi-squared divergence is $(\theta/(\theta + \delta))^n - 1$ . Optimizing $\delta^2 / [(\theta/(\theta + \delta))^n - 1]$ over $\delta$ gives a bound of order $\theta^2/n^2$ , matching the actual $\Theta(\theta^2/n^2)$ variance of the MLE $\hat\theta = X_{(n)}$ .

Example

Linear regression beta hat is efficient

Let $Y \mid X \sim \mathcal N(X\beta, \sigma^2 I_n)$ with $X \in \mathbb R^{n \times d}$ fixed full-rank. The Fisher information for $\beta$ (with $\sigma^2$ known) is $I(\beta) = X^\top X / \sigma^2$ . The Cramér-Rao bound is $\mathrm{Cov}(\hat\beta) \succeq \sigma^2 (X^\top X)^{-1}$ . The OLS estimator $\hat\beta_{\mathrm{OLS}} = (X^\top X)^{-1} X^\top Y$ has covariance exactly $\sigma^2 (X^\top X)^{-1}$ , attaining the bound. This is the Gauss-Markov theorem in its Cramér-Rao form: OLS is efficient under Gaussian noise (and BLUE under any noise with finite variance).

Common Confusions

Watch Out

Cramér-Rao does not bound MSE for biased estimators

The bound $\mathrm{Var}(\hat\theta) \geq 1/(nI)$ governs variance, not MSE. MSE = variance + bias². A biased estimator can have lower MSE than $1/(nI)$ when the bias is small and the variance is much smaller. James-Stein, ridge regression, LASSO, posterior means, and shrinkage estimators all do this systematically. The bound is vacuous as a statement about MSE for the biased class.

Watch Out

Asymptotic efficiency is weaker than finite-sample efficiency

The MLE is asymptotically efficient under standard regularity: $\sqrt n (\hat\theta_n - \theta) \Rightarrow \mathcal N(0, I(\theta)^{-1})$ . This says the limiting variance equals the Cramér-Rao bound. It does not say the MLE attains the bound at any finite $n$ (it generally does not, except in one-parameter exponential families). Finite-sample variance is typically $\mathrm{Var}(\hat\theta_n) = 1/(nI) + O(1/n^2)$ , with the second-order term computable via the Bartlett correction or Edgeworth expansion.

Watch Out

The bound applies to unbiased estimators of theta, not of g(theta)

The Cramér-Rao bound for the parameter $\theta$ is $\mathrm{Var}(\hat\theta) \geq 1/(nI(\theta))$ . For an unbiased estimator $T$ of a transformed parameter $g(\theta)$ , the relevant bound is $\mathrm{Var}(T) \geq [g'(\theta)]^2 / (nI(\theta))$ , which is the chain-rule version. Conflating the two is a common error: e.g. estimating $\sigma^2$ vs $\sigma$ requires applying the chain rule with $g(\sigma^2) = \sqrt{\sigma^2}$ , $g'(\sigma^2) = 1/(2\sqrt{\sigma^2})$ .

Watch Out

The bound is local at theta, not uniform in theta

The Cramér-Rao bound is a function of $\theta$ : $\mathrm{Var}(\hat\theta) \geq 1/(nI(\theta))$ . An estimator can attain the bound at one value of $\theta$ and exceed it at others; the uniform attainability characterizes one-parameter exponential families. For minimax-optimal inference (worst-case over $\theta$ ), use the van Trees inequality or Le Cam's method instead.

Watch Out

With nuisance parameters, the right denominator is efficient information

For a parameter of interest $\mu$ with a nuisance parameter $\eta$ , the asymptotic variance of any regular estimator of $\mu$ is bounded below by $(I^*)^{-1}$ where $I^* = I_{\mu\mu} - I_{\mu\eta} I_{\eta\eta}^{-1} I_{\eta\mu}$ is the Schur complement (the efficient information). Using $(I_{\mu\mu})^{-1}$ (the marginal Fisher information) is wrong whenever $I_{\mu\eta} \neq 0$ ; it under-states the achievable variance. The MLE for $\mu$ is asymptotically efficient with respect to $(I^*)^{-1}$ .

Watch Out

Singular Fisher matrix means the model is non-identifiable in some direction

A singular Fisher information matrix $I(\theta)$ has zero eigenvalues, indicating directions in parameter space along which the likelihood is locally flat. No data set, however large, can distinguish parameters along these directions; the Cramér-Rao bound is infinite (vacuous) in those directions. Either reparameterize to identify only the estimable contrasts, or impose regularization (a prior, a sparsity constraint) to pin down the model.

Exercises

ExerciseCore

Problem

Compute the Fisher information for $X \sim \mathrm{Bernoulli}(p)$ and state the Cramér-Rao bound for estimating $p$ from $n$ i.i.d. observations. Verify that $\bar X = \frac{1}{n}\sum X_i$ is efficient.

ExerciseCore

Problem

Let $X_1, \ldots, X_n$ be i.i.d. from a one-parameter exponential family in canonical form: $p(x \mid \eta) = h(x) \exp(\eta T(x) - A(\eta))$ . Let $\hat\theta = \bar T = \frac{1}{n}\sum_i T(X_i)$ estimate $\mu(\eta) = A'(\eta) = \mathbb E_\eta[T(X)]$ .

(a) Show $\hat\theta$ is unbiased for $\mu(\eta)$ . (b) Compute $I(\eta)$ for one observation. (c) Show $\mathrm{Var}(\hat\theta) = A''(\eta)/n$ and that this attains the chain-rule Cramér-Rao bound for estimating $\mu(\eta)$ .

ExerciseAdvanced

Problem

Show that the Cramér-Rao bound for estimating $\sigma^2$ from $X_1, \ldots, X_n \sim \mathcal N(0, \sigma^2)$ is $2\sigma^4/n$ . The unbiased estimator $\hat\sigma^2 = \frac{1}{n}\sum X_i^2$ has variance $2\sigma^4/n$ — exactly attaining the bound. Now consider estimating $\sigma$ (not $\sigma^2$ ). Compute the chain-rule Cramér-Rao bound for $\sigma$ , and show that the unbiased estimator $c_n \sqrt{\sum X_i^2}$ (with $c_n$ chosen for unbiasedness) has variance strictly larger than this bound at finite $n$ .

ExerciseAdvanced

Problem

Consider the location family $X_1, \ldots, X_n \sim p(x - \theta)$ for some symmetric, smooth density $p$ . Show that the Fisher information is $I(\theta) = \int (p'(z))^2 / p(z) dz$ , independent of $\theta$ . Compute $I(\theta)$ for $p$ = Gaussian, Laplace, and Cauchy. Comment on which has the highest Fisher information per observation.

ExerciseResearch

Problem

Use the van Trees inequality to derive a lower bound on the minimax risk $\inf_{\hat\theta} \sup_{\theta \in [0, 1]} \mathbb E_\theta[(\hat\theta - \theta)^2]$ for estimating $\theta$ from $X_1, \ldots, X_n \sim \mathrm{Bernoulli}(\theta)$ with $\theta \in [0, 1]$ . Choose a smooth prior supported on a small ball around any fixed $\theta_0 \in (0, 1)$ , apply van Trees, then take the supremum over $\theta_0$ .

Related Comparisons

Cramér-Rao Bound vs. Minimax Lower Bounds

References

Canonical:

Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury. Chapter 7.3 (Cramér-Rao inequality and efficiency).
Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. Chapter 2 (unbiasedness, Cramér-Rao, UMVUE), Chapter 6 (multivariate Cramér-Rao).
Schervish, M. J. (1995). Theory of Statistics. Springer. Section 2.3 (information inequalities, Bhattacharyya and HCR bounds).
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Original treatment of the inequality (Section 32.3).
Rao, C. R. (1945). "Information and the accuracy attainable in the estimation of statistical parameters." Bulletin of the Calcutta Mathematical Society, 37, 81-91. The other half of the eponymous bound.

Current:

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Chapter 8 (efficient estimators, asymptotic Cramér-Rao bound), Chapter 25 (semiparametric efficient information).
Bickel, P. J., Klaassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press. Chapter 3 (efficient information with nuisance parameters; the modern reference for $I^*$ ).
Keener, R. W. (2010). Theoretical Statistics. Springer. Chapter 3 (unbiased estimation and efficiency), Chapter 4 (UMVUE via Lehmann-Scheffé).
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. Chapter 11.10 (Fisher information and the Cramér-Rao bound, KL connection).
Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer. Chapter 2 (van Trees inequality and Le Cam's two-point method for minimax lower bounds).

Critique and refinements:

Hammersley, J. M. (1950). "On estimating restricted parameters." Journal of the Royal Statistical Society B, 12, 192-240. The HCR bound.
Chapman, D. G., & Robbins, H. (1951). "Minimum variance estimation without regularity assumptions." Annals of Mathematical Statistics, 22, 581-586. The other half of the eponymous bound.
Bhattacharyya, A. (1946). "On some analogues of the amount of information and their use in statistical estimation." Sankhyā, 8, 1-14. Higher-order information bounds.
van Trees, H. L. (1968). Detection, Estimation, and Modulation Theory, Part I. Wiley. The Bayesian Cramér-Rao inequality (Chapter 2).
Stein, C. (1956). "Inadmissibility of the usual estimator for the mean of a multivariate normal distribution." Proceedings of the Third Berkeley Symposium, 1, 197-206. Shows that biased shrinkage estimators beat the Cramér-Rao-efficient sample mean in MSE for $d \geq 3$ . See shrinkage estimation.
Brown, L. D., & Gajek, L. (1990). "Information inequalities for the Bayes risk." Annals of Statistics, 18, 1578-1594. Sharper Bayesian information inequalities than van Trees in some settings.

Next Topics

Asymptotic statistics: the MLE attains the Cramér-Rao bound asymptotically; LAN families, contiguity, the convolution theorem.
Maximum likelihood estimation: where the asymptotic normal limit $\mathcal N(0, I(\theta)^{-1})$ comes from.
Minimax lower bounds: van Trees and Le Cam techniques, the right framework when no unbiased efficient estimator exists.
Shrinkage estimation: James-Stein: how bias buys you variance and beats the Cramér-Rao-efficient sample mean.
Rao-Blackwellization: the constructive route to UMVUE when the bound is not attained.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
KL Divergencelayer 1 · tier 1

Derived topics

3

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterizationlayer 0B · tier 1
Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1

Graph-backed continuations

Asymptotic Statistics: M-Estimators, Delta Method, LAN Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization