Skip to main content

ML Methods

Gauss-Markov Theorem

Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.

CoreTier 1StableSupporting~45 min

Why This Matters

Gauss–Markov: OLS is the orthogonal projection onto col(X) and the smallest-variance linear unbiased estimator

When you run a linear regression using ordinary least squares (OLS), you are making an implicit claim: this is the best way to estimate the coefficients. The Gauss-Markov theorem tells you exactly when that claim is justified, and when it is not.

Understanding Gauss-Markov is essential because it tells you the default choice (OLS) and the conditions under which you should deviate from it.

Quick Version

  • What it proves. Among all linear unbiased estimators, OLS has the smallest variance.
  • What “best” means. Best means variance-optimal, not universally optimal across biased or nonlinear estimators.
  • What assumptions matter. Zero-mean, homoscedastic, uncorrelated errors, with a fixed full-rank design matrix.
  • What it does not require. Gaussian noise is not needed for the theorem itself; normality only enters later for exact finite-sample inference.

Mental Model

You want to estimate the coefficients β\beta in a linear model. There are many possible linear estimators: weighted least squares, ridge regression (which is biased), and arbitrary linear combinations of the data. Among all estimators that are both linear and unbiased, OLS gives you the one with the smallest variance.

This is a strong optimality guarantee, but it comes with conditions. Break the conditions and the guarantee evaporates.

Formal Setup

Definition

Linear Regression Model

The linear regression model is:

y=Xβ+ϵy = X\beta + \epsilon

where yRny \in \mathbb{R}^n is the response vector, XRn×pX \in \mathbb{R}^{n \times p} is the design matrix (assumed fixed and full column rank), βRp\beta \in \mathbb{R}^p is the coefficient vector, and ϵRn\epsilon \in \mathbb{R}^n is the error vector.

Definition

Gauss-Markov Assumptions

The Gauss-Markov assumptions on the error vector ϵ\epsilon are:

  1. Zero mean: E[ϵ]=0\mathbb{E}[\epsilon] = 0
  2. Homoscedasticity: Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2 for all ii
  3. Uncorrelated errors: Cov(ϵi,ϵj)=0\text{Cov}(\epsilon_i, \epsilon_j) = 0 for iji \neq j

In matrix form: E[ϵ]=0\mathbb{E}[\epsilon] = 0 and Var(ϵ)=σ2I\text{Var}(\epsilon) = \sigma^2 I.

Note: Gaussian (normal) errors are not required.

Definition

OLS Estimator

The ordinary least squares estimator is:

β^OLS=(XTX)1XTy\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T y

This minimizes the sum of squared residuals: yXβ2\|y - X\beta\|^2.

Definition

Linear Unbiased Estimator

An estimator β~=Cy\tilde{\beta} = Cy is linear (it is a linear function of yy) and unbiased if E[β~]=β\mathbb{E}[\tilde{\beta}] = \beta for all β\beta. The unbiasedness condition requires CX=IpCX = I_p.

The Theorem

Theorem

Gauss-Markov Theorem

Statement

Under the Gauss-Markov assumptions, the OLS estimator β^OLS=(XTX)1XTy\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T y is BLUE, the Best Linear Unbiased Estimator. That is, for any other linear unbiased estimator β~=Cy\tilde{\beta} = Cy, the difference of covariance matrices is positive semidefinite in the Loewner ordering:

Var(β~)Var(β^OLS)0.\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta}_{\text{OLS}}) \succeq 0.

This is the substantive content of the theorem. Two corollaries follow directly: (1) coordinate-wise variance dominance, Var(β~j)Var(β^OLS,j)\text{Var}(\tilde{\beta}_j) \geq \text{Var}(\hat{\beta}_{\text{OLS},j}) for all jj (take a=eja = e_j in aΣa0a^\top \Sigma a \geq 0); (2) variance dominance for any linear contrast, Var(aβ~)Var(aβ^OLS)\text{Var}(a^\top \tilde{\beta}) \geq \text{Var}(a^\top \hat{\beta}_{\text{OLS}}) for every aRpa \in \mathbb{R}^p. The Loewner statement is strictly stronger than the coordinate-wise inequalities: a matrix with nonnegative diagonal entries need not be PSD, so coordinate dominance alone does not recover contrast dominance.

Intuition

OLS is the most efficient use of the data among all linear unbiased methods. Any other linear unbiased estimator must have equal or larger variance for every coefficient. You cannot do better without either (a) introducing bias, (b) using a nonlinear estimator, or (c) exploiting additional knowledge about the errors.

Proof Sketch

Let β~=Cy\tilde{\beta} = Cy be any linear unbiased estimator. Write C=(XTX)1XT+DC = (X^T X)^{-1}X^T + D for some matrix DD. Unbiasedness (CX=ICX = I) forces DX=0DX = 0. Compute the variance:

Var(β~)=σ2CCT=σ2(XTX)1+σ2DDT\text{Var}(\tilde{\beta}) = \sigma^2 CC^T = \sigma^2(X^T X)^{-1} + \sigma^2 DD^T

Since DDT0DD^T \succeq 0, the difference Var(β~)Var(β^OLS)=σ2DDT\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta}_{\text{OLS}}) = \sigma^2 DD^T is positive semidefinite in the Loewner ordering. The cross term vanishes because DX=0DX = 0.

Why It Matters

This theorem justifies OLS as the default method for linear regression. If the Gauss-Markov assumptions hold, there is no reason to use any other linear unbiased estimator. This is why OLS is taught first and used most often.

Failure Mode

The theorem fails when: (1) errors are heteroscedastic (Var(ϵi)σ2\text{Var}(\epsilon_i) \neq \sigma^2), (2) errors are correlated (Cov(ϵi,ϵj)0\text{Cov}(\epsilon_i, \epsilon_j) \neq 0), or (3) you are willing to accept bias in exchange for lower variance (regularization). In all three cases, OLS is no longer optimal.

What BLUE Means

B = Best (minimum variance)

L = Linear (estimator is a linear function of yy)

U = Unbiased (E[β^]=β\mathbb{E}[\hat{\beta}] = \beta)

E = Estimator

The restriction to linear estimators is important. There may exist nonlinear unbiased estimators with lower variance. And biased estimators (like ridge regression) can have lower mean squared error by trading bias for variance.

When Assumptions Fail

Heteroscedasticity

If Var(ϵi)=σi2\text{Var}(\epsilon_i) = \sigma_i^2 varies across observations, OLS is still unbiased but no longer efficient. Use weighted least squares (WLS):

β^WLS=(XTWX)1XTWy\hat{\beta}_{\text{WLS}} = (X^T W X)^{-1} X^T W y

where W=diag(1/σ12,,1/σn2)W = \text{diag}(1/\sigma_1^2, \ldots, 1/\sigma_n^2). WLS is BLUE under heteroscedastic errors.

Heteroscedasticity-Robust Standard Errors

WLS requires knowing the variance pattern σi2\sigma_i^2. In practice you rarely do. The applied workflow is to keep using OLS (which remains unbiased and consistent) but replace the textbook covariance estimate with a heteroscedasticity-consistent (HC) sandwich estimator. White (1980) showed that

Var^HC(β^OLS)=(XTX)1(iϵ^i2xixiT)(XTX)1\widehat{\text{Var}}_{\text{HC}}(\hat{\beta}_{\text{OLS}}) = (X^T X)^{-1} \Big(\sum_i \hat{\epsilon}_i^2 x_i x_i^T\Big) (X^T X)^{-1}

is consistent for the asymptotic variance of OLS under arbitrary heteroscedasticity. This is the HC0 estimator. Finite-sample variants (HC1 multiplies by n/(np)n/(n-p); HC2 and HC3 reweight the residuals by the hat-matrix leverages hiih_{ii}, with HC3 the most conservative) reduce small-sample bias; MacKinnon and White (1985) and Long and Ervin (2000) recommend HC3 as the default in samples below a few hundred. These let you report valid standard errors without committing to a specific WLS weighting.

Correlated Errors

If Var(ϵ)=σ2Ω\text{Var}(\epsilon) = \sigma^2 \Omega where ΩI\Omega \neq I, use generalized least squares (GLS):

β^GLS=(XTΩ1X)1XTΩ1y\hat{\beta}_{\text{GLS}} = (X^T \Omega^{-1} X)^{-1} X^T \Omega^{-1} y

GLS is BLUE under the general covariance structure. OLS ignores correlations and is inefficient. Aitken's theorem (1935) is the proof: pre-multiply the model by Ω1/2\Omega^{-1/2} to get Ω1/2y=Ω1/2Xβ+Ω1/2ϵ\Omega^{-1/2} y = \Omega^{-1/2} X \beta + \Omega^{-1/2} \epsilon, in which the transformed errors satisfy the original Gauss-Markov assumptions, and OLS on the transformed model is BLUE. That is exactly β^GLS\hat{\beta}_{\text{GLS}}.

Cluster-Robust Standard Errors

When observations cluster (panel data, students within schools, repeated measures), errors are correlated within cluster but uncorrelated across clusters. The cluster-robust sandwich estimator (Liang and Zeger, 1986) groups the meat of the sandwich by cluster:

Var^CR(β^OLS)=(XTX)1(gXgTϵ^gϵ^gTXg)(XTX)1\widehat{\text{Var}}_{\text{CR}}(\hat{\beta}_{\text{OLS}}) = (X^T X)^{-1} \Big(\sum_g X_g^T \hat{\epsilon}_g \hat{\epsilon}_g^T X_g\Big) (X^T X)^{-1}

where gg indexes clusters. This is the standard standard-error in panel econometrics; it requires many clusters to be valid asymptotically (Cameron and Miller, 2015).

Connection to Cramér-Rao

The Cramér-Rao bound gives a lower bound on the variance of any unbiased estimator. Under the Gauss-Markov assumptions augmented with Gaussian errors ϵN(0,σ2I)\epsilon \sim \mathcal{N}(0, \sigma^2 I), the Fisher information matrix for β\beta is I(β)=XTX/σ2\mathcal{I}(\beta) = X^T X / \sigma^2, so the CR bound for any unbiased estimator of β\beta is

Var(β^)I(β)1=σ2(XTX)1.\text{Var}(\hat{\beta}) \succeq \mathcal{I}(\beta)^{-1} = \sigma^2 (X^T X)^{-1}.

This is exactly the OLS variance. Under Gaussian errors OLS is therefore not just BLUE. It is MVUE (minimum variance unbiased among all unbiased estimators, linear or nonlinear). Without Gaussianity, OLS remains BLUE but need not attain the CR bound; nonlinear unbiased estimators can have lower variance.

Biased Estimators

Ridge regression β^ridge=(XTX+λI)1XTy\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y is biased, so Gauss-Markov does not apply. But ridge can have lower mean squared error than OLS, especially when XTXX^T X is nearly singular (high multicollinearity).

Common Confusions

Watch Out

Gauss-Markov does NOT require Gaussian errors

The theorem says nothing about the distribution of ϵ\epsilon beyond its first two moments (zero mean, constant variance, uncorrelated). The errors can follow any distribution. Gaussian errors are needed for the F-test and t-test to have exact distributions, but not for Gauss-Markov itself.

Watch Out

BLUE does not mean best among ALL estimators

OLS is best among linear unbiased estimators. Biased estimators (ridge, lasso) or nonlinear estimators can do better in terms of mean squared error. The theorem is about a restricted competition.

Watch Out

Unbiasedness is not always desirable

The bias-variance tradeoff shows that a small amount of bias can dramatically reduce variance, lowering overall MSE. Gauss-Markov optimizes for zero bias, which is the right goal for inference but not always for prediction.

Summary

  • OLS is BLUE: best linear unbiased estimator under Gauss-Markov assumptions
  • Assumptions: E[ϵ]=0\mathbb{E}[\epsilon] = 0, Var(ϵ)=σ2I\text{Var}(\epsilon) = \sigma^2 I. no Gaussianity needed
  • Heteroscedasticity breaks optimality. use WLS
  • Correlated errors break optimality. use GLS
  • Biased estimators (ridge) are outside the scope of the theorem but can beat OLS in MSE
  • BLUE is about a restricted competition: linear and unbiased only

Exercises

ExerciseCore

Problem

State the three Gauss-Markov assumptions. For each, give an example of a real dataset where that assumption might be violated.

ExerciseAdvanced

Problem

Prove that if β~=Cy\tilde{\beta} = Cy is linear and unbiased for β\beta, then CX=IpCX = I_p. Start from the definition of unbiasedness.

ExerciseResearch

Problem

Consider the James-Stein estimator, which shrinks OLS estimates toward zero. It is biased but has strictly lower total MSE than OLS when p3p \geq 3. How does this not contradict Gauss-Markov?

References

Canonical:

  • Gauss (1821) and Markov (1912). The original results
  • Aitken (1935), On Least Squares and Linear Combinations of Observations, Proceedings of the Royal Society of Edinburgh. GLS as OLS on the Ω1/2\Omega^{-1/2}-transformed model
  • Greene, Econometric Analysis (2018), Chapter 4
  • Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 3.2
  • Casella & Berger, Statistical Inference (2002), Section 11.3. Cramér-Rao and MVUE for the linear model

Robust standard errors:

  • White (1980), A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity, Econometrica
  • MacKinnon & White (1985), Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties, Journal of Econometrics. HC1, HC2, HC3
  • Long & Ervin (2000), Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model, The American Statistician. Recommends HC3 by default
  • Liang & Zeger (1986), Longitudinal Data Analysis Using Generalized Linear Models, Biometrika. Cluster-robust sandwich
  • Cameron & Miller (2015), A Practitioner's Guide to Cluster-Robust Inference, Journal of Human Resources

Current:

  • Hayashi, Econometrics (2000), Chapter 1. clear modern treatment
  • Wooldridge, Econometric Analysis of Cross Section and Panel Data (2010). Modern panel and cluster-SE treatment
  • Stock & Watson, Introduction to Econometrics (2020). Applied perspective on robust SEs

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

2

Graph-backed continuations