ML Methods
Gauss-Markov Theorem
Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.
Prerequisites
Why This Matters
Gauss–Markov: OLS is the orthogonal projection onto col(X) and the smallest-variance linear unbiased estimator
When you run a linear regression using ordinary least squares (OLS), you are making an implicit claim: this is the best way to estimate the coefficients. The Gauss-Markov theorem tells you exactly when that claim is justified, and when it is not.
Understanding Gauss-Markov is essential because it tells you the default choice (OLS) and the conditions under which you should deviate from it.
Quick Version
- What it proves. Among all linear unbiased estimators, OLS has the smallest variance.
- What “best” means. Best means variance-optimal, not universally optimal across biased or nonlinear estimators.
- What assumptions matter. Zero-mean, homoscedastic, uncorrelated errors, with a fixed full-rank design matrix.
- What it does not require. Gaussian noise is not needed for the theorem itself; normality only enters later for exact finite-sample inference.
Mental Model
You want to estimate the coefficients in a linear model. There are many possible linear estimators: weighted least squares, ridge regression (which is biased), and arbitrary linear combinations of the data. Among all estimators that are both linear and unbiased, OLS gives you the one with the smallest variance.
This is a strong optimality guarantee, but it comes with conditions. Break the conditions and the guarantee evaporates.
Formal Setup
Linear Regression Model
The linear regression model is:
where is the response vector, is the design matrix (assumed fixed and full column rank), is the coefficient vector, and is the error vector.
Gauss-Markov Assumptions
The Gauss-Markov assumptions on the error vector are:
- Zero mean:
- Homoscedasticity: for all
- Uncorrelated errors: for
In matrix form: and .
Note: Gaussian (normal) errors are not required.
OLS Estimator
The ordinary least squares estimator is:
This minimizes the sum of squared residuals: .
Linear Unbiased Estimator
An estimator is linear (it is a linear function of ) and unbiased if for all . The unbiasedness condition requires .
The Theorem
Gauss-Markov Theorem
Statement
Under the Gauss-Markov assumptions, the OLS estimator is BLUE, the Best Linear Unbiased Estimator. That is, for any other linear unbiased estimator , the difference of covariance matrices is positive semidefinite in the Loewner ordering:
This is the substantive content of the theorem. Two corollaries follow directly: (1) coordinate-wise variance dominance, for all (take in ); (2) variance dominance for any linear contrast, for every . The Loewner statement is strictly stronger than the coordinate-wise inequalities: a matrix with nonnegative diagonal entries need not be PSD, so coordinate dominance alone does not recover contrast dominance.
Intuition
OLS is the most efficient use of the data among all linear unbiased methods. Any other linear unbiased estimator must have equal or larger variance for every coefficient. You cannot do better without either (a) introducing bias, (b) using a nonlinear estimator, or (c) exploiting additional knowledge about the errors.
Proof Sketch
Let be any linear unbiased estimator. Write for some matrix . Unbiasedness () forces . Compute the variance:
Since , the difference is positive semidefinite in the Loewner ordering. The cross term vanishes because .
Why It Matters
This theorem justifies OLS as the default method for linear regression. If the Gauss-Markov assumptions hold, there is no reason to use any other linear unbiased estimator. This is why OLS is taught first and used most often.
Failure Mode
The theorem fails when: (1) errors are heteroscedastic (), (2) errors are correlated (), or (3) you are willing to accept bias in exchange for lower variance (regularization). In all three cases, OLS is no longer optimal.
What BLUE Means
B = Best (minimum variance)
L = Linear (estimator is a linear function of )
U = Unbiased ()
E = Estimator
The restriction to linear estimators is important. There may exist nonlinear unbiased estimators with lower variance. And biased estimators (like ridge regression) can have lower mean squared error by trading bias for variance.
When Assumptions Fail
Heteroscedasticity
If varies across observations, OLS is still unbiased but no longer efficient. Use weighted least squares (WLS):
where . WLS is BLUE under heteroscedastic errors.
Heteroscedasticity-Robust Standard Errors
WLS requires knowing the variance pattern . In practice you rarely do. The applied workflow is to keep using OLS (which remains unbiased and consistent) but replace the textbook covariance estimate with a heteroscedasticity-consistent (HC) sandwich estimator. White (1980) showed that
is consistent for the asymptotic variance of OLS under arbitrary heteroscedasticity. This is the HC0 estimator. Finite-sample variants (HC1 multiplies by ; HC2 and HC3 reweight the residuals by the hat-matrix leverages , with HC3 the most conservative) reduce small-sample bias; MacKinnon and White (1985) and Long and Ervin (2000) recommend HC3 as the default in samples below a few hundred. These let you report valid standard errors without committing to a specific WLS weighting.
Correlated Errors
If where , use generalized least squares (GLS):
GLS is BLUE under the general covariance structure. OLS ignores correlations and is inefficient. Aitken's theorem (1935) is the proof: pre-multiply the model by to get , in which the transformed errors satisfy the original Gauss-Markov assumptions, and OLS on the transformed model is BLUE. That is exactly .
Cluster-Robust Standard Errors
When observations cluster (panel data, students within schools, repeated measures), errors are correlated within cluster but uncorrelated across clusters. The cluster-robust sandwich estimator (Liang and Zeger, 1986) groups the meat of the sandwich by cluster:
where indexes clusters. This is the standard standard-error in panel econometrics; it requires many clusters to be valid asymptotically (Cameron and Miller, 2015).
Connection to Cramér-Rao
The Cramér-Rao bound gives a lower bound on the variance of any unbiased estimator. Under the Gauss-Markov assumptions augmented with Gaussian errors , the Fisher information matrix for is , so the CR bound for any unbiased estimator of is
This is exactly the OLS variance. Under Gaussian errors OLS is therefore not just BLUE. It is MVUE (minimum variance unbiased among all unbiased estimators, linear or nonlinear). Without Gaussianity, OLS remains BLUE but need not attain the CR bound; nonlinear unbiased estimators can have lower variance.
Biased Estimators
Ridge regression is biased, so Gauss-Markov does not apply. But ridge can have lower mean squared error than OLS, especially when is nearly singular (high multicollinearity).
Common Confusions
Gauss-Markov does NOT require Gaussian errors
The theorem says nothing about the distribution of beyond its first two moments (zero mean, constant variance, uncorrelated). The errors can follow any distribution. Gaussian errors are needed for the F-test and t-test to have exact distributions, but not for Gauss-Markov itself.
BLUE does not mean best among ALL estimators
OLS is best among linear unbiased estimators. Biased estimators (ridge, lasso) or nonlinear estimators can do better in terms of mean squared error. The theorem is about a restricted competition.
Unbiasedness is not always desirable
The bias-variance tradeoff shows that a small amount of bias can dramatically reduce variance, lowering overall MSE. Gauss-Markov optimizes for zero bias, which is the right goal for inference but not always for prediction.
Summary
- OLS is BLUE: best linear unbiased estimator under Gauss-Markov assumptions
- Assumptions: , . no Gaussianity needed
- Heteroscedasticity breaks optimality. use WLS
- Correlated errors break optimality. use GLS
- Biased estimators (ridge) are outside the scope of the theorem but can beat OLS in MSE
- BLUE is about a restricted competition: linear and unbiased only
Exercises
Problem
State the three Gauss-Markov assumptions. For each, give an example of a real dataset where that assumption might be violated.
Problem
Prove that if is linear and unbiased for , then . Start from the definition of unbiasedness.
Problem
Consider the James-Stein estimator, which shrinks OLS estimates toward zero. It is biased but has strictly lower total MSE than OLS when . How does this not contradict Gauss-Markov?
References
Canonical:
- Gauss (1821) and Markov (1912). The original results
- Aitken (1935), On Least Squares and Linear Combinations of Observations, Proceedings of the Royal Society of Edinburgh. GLS as OLS on the -transformed model
- Greene, Econometric Analysis (2018), Chapter 4
- Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 3.2
- Casella & Berger, Statistical Inference (2002), Section 11.3. Cramér-Rao and MVUE for the linear model
Robust standard errors:
- White (1980), A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity, Econometrica
- MacKinnon & White (1985), Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties, Journal of Econometrics. HC1, HC2, HC3
- Long & Ervin (2000), Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model, The American Statistician. Recommends HC3 by default
- Liang & Zeger (1986), Longitudinal Data Analysis Using Generalized Linear Models, Biometrika. Cluster-robust sandwich
- Cameron & Miller (2015), A Practitioner's Guide to Cluster-Robust Inference, Journal of Human Resources
Current:
- Hayashi, Econometrics (2000), Chapter 1. clear modern treatment
- Wooldridge, Econometric Analysis of Cross Section and Panel Data (2010). Modern panel and cluster-SE treatment
- Stock & Watson, Introduction to Econometrics (2020). Applied perspective on robust SEs
Last reviewed: April 27, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- The Multivariate Normal Distributionlayer 0B · tier 1
- Linear Regressionlayer 1 · tier 1
Derived topics
2- Ridge Regressionlayer 1 · tier 1
- AIC and BIClayer 2 · tier 1
Graph-backed continuations