Gauss-Markov Theorem

Sneiderman, Robby

ML Methods

Gauss-Markov Theorem

Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.

CoreTier 1StableSupporting~45 min

Prerequisites

Linear Regression Multivariate Normal Distribution

Quiz (4)Prereq Map

Why This Matters

Gauss–Markov: OLS is the orthogonal projection onto col(X) and the smallest-variance linear unbiased estimator

When you run a linear regression using ordinary least squares (OLS), you are making an implicit claim: this is the best way to estimate the coefficients. The Gauss-Markov theorem tells you exactly when that claim is justified, and when it is not.

Understanding Gauss-Markov is essential because it tells you the default choice (OLS) and the conditions under which you should deviate from it.

Quick Version

What it proves. Among all linear unbiased estimators, OLS has the smallest variance.
What “best” means. Best means variance-optimal, not universally optimal across biased or nonlinear estimators.
What assumptions matter. Zero-mean, homoscedastic, uncorrelated errors, with a fixed full-rank design matrix.
What it does not require. Gaussian noise is not needed for the theorem itself; normality only enters later for exact finite-sample inference.

Mental Model

You want to estimate the coefficients $\beta$ in a linear model. There are many possible linear estimators: weighted least squares, ridge regression (which is biased), and arbitrary linear combinations of the data. Among all estimators that are both linear and unbiased, OLS gives you the one with the smallest variance.

This is a strong optimality guarantee, but it comes with conditions. Break the conditions and the guarantee evaporates.

Formal Setup

Definition

Linear Regression Model

The linear regression model is:

$y = X\beta + \epsilon$

where $y \in \mathbb{R}^n$ is the response vector, $X \in \mathbb{R}^{n \times p}$ is the design matrix (assumed fixed and full column rank), $\beta \in \mathbb{R}^p$ is the coefficient vector, and $\epsilon \in \mathbb{R}^n$ is the error vector.

Definition

Gauss-Markov Assumptions

The Gauss-Markov assumptions on the error vector $\epsilon$ are:

Zero mean: $\mathbb{E}[\epsilon] = 0$
Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$
Uncorrelated errors: $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$

In matrix form: $\mathbb{E}[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2 I$ .

Note: Gaussian (normal) errors are not required.

Definition

OLS Estimator

The ordinary least squares estimator is:

$\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T y$

This minimizes the sum of squared residuals: $\|y - X\beta\|^2$ .

Definition

Linear Unbiased Estimator

An estimator $\tilde{\beta} = Cy$ is linear (it is a linear function of $y$ ) and unbiased if $\mathbb{E}[\tilde{\beta}] = \beta$ for all $\beta$ . The unbiasedness condition requires $CX = I_p$ .

The Theorem

Theorem

Gauss-Markov Theorem

Statement

Under the Gauss-Markov assumptions, the OLS estimator $\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T y$ is BLUE, the Best Linear Unbiased Estimator. That is, for any other linear unbiased estimator $\tilde{\beta} = Cy$ , the difference of covariance matrices is positive semidefinite in the Loewner ordering:

$\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta}_{\text{OLS}}) \succeq 0.$

This is the substantive content of the theorem. Two corollaries follow directly: (1) coordinate-wise variance dominance, $\text{Var}(\tilde{\beta}_j) \geq \text{Var}(\hat{\beta}_{\text{OLS},j})$ for all $j$ (take $a = e_j$ in $a^\top \Sigma a \geq 0$ ); (2) variance dominance for any linear contrast, $\text{Var}(a^\top \tilde{\beta}) \geq \text{Var}(a^\top \hat{\beta}_{\text{OLS}})$ for every $a \in \mathbb{R}^p$ . The Loewner statement is strictly stronger than the coordinate-wise inequalities: a matrix with nonnegative diagonal entries need not be PSD, so coordinate dominance alone does not recover contrast dominance.

Intuition

OLS is the most efficient use of the data among all linear unbiased methods. Any other linear unbiased estimator must have equal or larger variance for every coefficient. You cannot do better without either (a) introducing bias, (b) using a nonlinear estimator, or (c) exploiting additional knowledge about the errors.

Proof Sketch

Let $\tilde{\beta} = Cy$ be any linear unbiased estimator. Write $C = (X^T X)^{-1}X^T + D$ for some matrix $D$ . Unbiasedness ( $CX = I$ ) forces $DX = 0$ . Compute the variance:

$\text{Var}(\tilde{\beta}) = \sigma^2 CC^T = \sigma^2(X^T X)^{-1} + \sigma^2 DD^T$

Since $DD^T \succeq 0$ , the difference $\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta}_{\text{OLS}}) = \sigma^2 DD^T$ is positive semidefinite in the Loewner ordering. The cross term vanishes because $DX = 0$ .

Why It Matters

This theorem justifies OLS as the default method for linear regression. If the Gauss-Markov assumptions hold, there is no reason to use any other linear unbiased estimator. This is why OLS is taught first and used most often.

Failure Mode

The theorem fails when: (1) errors are heteroscedastic ( $\text{Var}(\epsilon_i) \neq \sigma^2$ ), (2) errors are correlated ( $\text{Cov}(\epsilon_i, \epsilon_j) \neq 0$ ), or (3) you are willing to accept bias in exchange for lower variance (regularization). In all three cases, OLS is no longer optimal.

report a correction →

What BLUE Means

B = Best (minimum variance)

L = Linear (estimator is a linear function of $y$ )

U = Unbiased ( $\mathbb{E}[\hat{\beta}] = \beta$ )

E = Estimator

The restriction to linear estimators is important. There may exist nonlinear unbiased estimators with lower variance. And biased estimators (like ridge regression) can have lower mean squared error by trading bias for variance.

When Assumptions Fail

Heteroscedasticity

If $\text{Var}(\epsilon_i) = \sigma_i^2$ varies across observations, OLS is still unbiased but no longer efficient. Use weighted least squares (WLS):

$\hat{\beta}_{\text{WLS}} = (X^T W X)^{-1} X^T W y$

where $W = \text{diag}(1/\sigma_1^2, \ldots, 1/\sigma_n^2)$ . WLS is BLUE under heteroscedastic errors.

Heteroscedasticity-Robust Standard Errors

WLS requires knowing the variance pattern $\sigma_i^2$ . In practice you rarely do. The applied workflow is to keep using OLS (which remains unbiased and consistent) but replace the textbook covariance estimate with a heteroscedasticity-consistent (HC) sandwich estimator. White (1980) showed that

$\widehat{\text{Var}}_{\text{HC}}(\hat{\beta}_{\text{OLS}}) = (X^T X)^{-1} \Big(\sum_i \hat{\epsilon}_i^2 x_i x_i^T\Big) (X^T X)^{-1}$

is consistent for the asymptotic variance of OLS under arbitrary heteroscedasticity. This is the HC0 estimator. Finite-sample variants (HC1 multiplies by $n/(n-p)$ ; HC2 and HC3 reweight the residuals by the hat-matrix leverages $h_{ii}$ , with HC3 the most conservative) reduce small-sample bias; MacKinnon and White (1985) and Long and Ervin (2000) recommend HC3 as the default in samples below a few hundred. These let you report valid standard errors without committing to a specific WLS weighting.

Correlated Errors

If $\text{Var}(\epsilon) = \sigma^2 \Omega$ where $\Omega \neq I$ , use generalized least squares (GLS):

$\hat{\beta}_{\text{GLS}} = (X^T \Omega^{-1} X)^{-1} X^T \Omega^{-1} y$

GLS is BLUE under the general covariance structure. OLS ignores correlations and is inefficient. Aitken's theorem (1935) is the proof: pre-multiply the model by $\Omega^{-1/2}$ to get $\Omega^{-1/2} y = \Omega^{-1/2} X \beta + \Omega^{-1/2} \epsilon$ , in which the transformed errors satisfy the original Gauss-Markov assumptions, and OLS on the transformed model is BLUE. That is exactly $\hat{\beta}_{\text{GLS}}$ .

Cluster-Robust Standard Errors

When observations cluster (panel data, students within schools, repeated measures), errors are correlated within cluster but uncorrelated across clusters. The cluster-robust sandwich estimator (Liang and Zeger, 1986) groups the meat of the sandwich by cluster:

$\widehat{\text{Var}}_{\text{CR}}(\hat{\beta}_{\text{OLS}}) = (X^T X)^{-1} \Big(\sum_g X_g^T \hat{\epsilon}_g \hat{\epsilon}_g^T X_g\Big) (X^T X)^{-1}$

where $g$ indexes clusters. This is the standard standard-error in panel econometrics; it requires many clusters to be valid asymptotically (Cameron and Miller, 2015).

Connection to Cramér-Rao

The Cramér-Rao bound gives a lower bound on the variance of any unbiased estimator. Under the Gauss-Markov assumptions augmented with Gaussian errors $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$ , the Fisher information matrix for $\beta$ is $\mathcal{I}(\beta) = X^T X / \sigma^2$ , so the CR bound for any unbiased estimator of $\beta$ is

$\text{Var}(\hat{\beta}) \succeq \mathcal{I}(\beta)^{-1} = \sigma^2 (X^T X)^{-1}.$

This is exactly the OLS variance. Under Gaussian errors OLS is therefore not just BLUE. It is MVUE (minimum variance unbiased among all unbiased estimators, linear or nonlinear). Without Gaussianity, OLS remains BLUE but need not attain the CR bound; nonlinear unbiased estimators can have lower variance.

Biased Estimators

Ridge regression $\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y$ is biased, so Gauss-Markov does not apply. But ridge can have lower mean squared error than OLS, especially when $X^T X$ is nearly singular (high multicollinearity).

Common Confusions

Watch Out

Gauss-Markov does NOT require Gaussian errors

The theorem says nothing about the distribution of $\epsilon$ beyond its first two moments (zero mean, constant variance, uncorrelated). The errors can follow any distribution. Gaussian errors are needed for the F-test and t-test to have exact distributions, but not for Gauss-Markov itself.

Watch Out

BLUE does not mean best among ALL estimators

OLS is best among linear unbiased estimators. Biased estimators (ridge, lasso) or nonlinear estimators can do better in terms of mean squared error. The theorem is about a restricted competition.

Watch Out

Unbiasedness is not always desirable

The bias-variance tradeoff shows that a small amount of bias can dramatically reduce variance, lowering overall MSE. Gauss-Markov optimizes for zero bias, which is the right goal for inference but not always for prediction.

Summary

OLS is BLUE: best linear unbiased estimator under Gauss-Markov assumptions
Assumptions: $\mathbb{E}[\epsilon] = 0$ , $\text{Var}(\epsilon) = \sigma^2 I$ . no Gaussianity needed
Heteroscedasticity breaks optimality. use WLS
Correlated errors break optimality. use GLS
Biased estimators (ridge) are outside the scope of the theorem but can beat OLS in MSE
BLUE is about a restricted competition: linear and unbiased only

Exercises

ExerciseCore

Problem

State the three Gauss-Markov assumptions. For each, give an example of a real dataset where that assumption might be violated.

ExerciseAdvanced

Problem

Prove that if $\tilde{\beta} = Cy$ is linear and unbiased for $\beta$ , then $CX = I_p$ . Start from the definition of unbiasedness.

ExerciseResearch

Problem

Consider the James-Stein estimator, which shrinks OLS estimates toward zero. It is biased but has strictly lower total MSE than OLS when $p \geq 3$ . How does this not contradict Gauss-Markov?

References

Canonical:

Gauss (1821) and Markov (1912). The original results
Aitken (1935), On Least Squares and Linear Combinations of Observations, Proceedings of the Royal Society of Edinburgh. GLS as OLS on the $\Omega^{-1/2}$ -transformed model
Greene, Econometric Analysis (2018), Chapter 4
Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 3.2
Casella & Berger, Statistical Inference (2002), Section 11.3. Cramér-Rao and MVUE for the linear model

Robust standard errors:

White (1980), A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity, Econometrica
MacKinnon & White (1985), Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties, Journal of Econometrics. HC1, HC2, HC3
Long & Ervin (2000), Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model, The American Statistician. Recommends HC3 by default
Liang & Zeger (1986), Longitudinal Data Analysis Using Generalized Linear Models, Biometrika. Cluster-robust sandwich
Cameron & Miller (2015), A Practitioner's Guide to Cluster-Robust Inference, Journal of Human Resources

Current:

Hayashi, Econometrics (2000), Chapter 1. clear modern treatment
Wooldridge, Econometric Analysis of Cross Section and Panel Data (2010). Modern panel and cluster-SE treatment
Stock & Watson, Introduction to Econometrics (2020). Applied perspective on robust SEs

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

The Multivariate Normal Distributionlayer 0B · tier 1
Linear Regressionlayer 1 · tier 1

Derived topics

2

Ridge Regressionlayer 1 · tier 1
AIC and BIClayer 2 · tier 1

Graph-backed continuations

AIC and BIC Ridge Regression