Linear Regression

Sneiderman, Robby

ML Methods

Linear Regression

Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.

CoreTier 1StableCore spine~50 min

Prerequisites

Matrix Operations and Properties Maximum Likelihood Estimation Elements of Statistical Learning Book Naive Bayes

Quiz (15)Pulse Check Prereq Map

Why This Matters

theorem visual

XOR is not linearly separable

$Three candidate hyperplanes on the left always misclassify at least one point. A 2-hidden-unit ReLU network on the right places both positive-class points (0,1) and (1,0) inside the strip 0.5 < x_1 + x_2 < 1.5 and both negative points outside it.$

$The OLS fit on this dataset, with intercept, is \hat{b} = 1/2, \overset{w}{^}_{1} = \overset{w}{^}_{2} = 0 — the model predicts the mean at every point. Without intercept, \overset{w}{^}_{1} = \overset{w}{^}_{2} = 1/3, predicting 0, 1/3, 1/3, 2/3 on the four corners. Both confirm: no linear separator exists.$

Linear regression is the most fundamental supervised learning method. Every idea in regression (projection, residuals, bias-variance, regularization) generalizes directly to more complex models. If you understand linear regression at the level of linear algebra — not just "fit a line" — you have the skeleton key for half of statistical learning.

The right way to think about OLS is not as a trend line drawn through points on a 2D scatterplot. The right picture is higher-dimensional: the design matrix $X$ spans a subspace of $\mathbb{R}^n$ , the response vector $y$ lives somewhere in that ambient space, and the fitted response $\hat y$ is the orthogonal projection of $y$ onto $\operatorname{col}(X)$ . Everything else follows from that projection picture: the normal equations, the hat matrix, residual orthogonality, leverage, and the variance formula.

That is why linear regression remains foundational even when nobody actually cares about straight lines. The same ideas reappear in generalized linear models, kernel methods, ridge regression, Gaussian process regression, and modern overparameterized interpolation theory.

Quick Version

Optimization view. OLS picks the weight vector minimizing squared residuals.
Geometry view. The fitted response $\hat y$ is the orthogonal projection of $y$ onto the column space of $X$ .
Algebra view. The normal equations $X^\top X \hat w = X^\top y$ are the first-order condition for that projection problem.
Statistics view. Under the Gauss-Markov assumptions, OLS is the best linear unbiased estimator, and under Gaussian noise it is also the maximum likelihood estimator.

Mental Model

You have data points in a high-dimensional space. The columns of your design matrix $X$ span a subspace. OLS finds the point in that subspace closest to your target vector $y$ . This is an orthogonal projection. The residuals are the component of $y$ orthogonal to the column space of $X$ .

Formal Setup

We observe $n$ input-output pairs. Stack the inputs into a design matrix $X \in \mathbb{R}^{n \times d}$ and responses into $y \in \mathbb{R}^n$ . We seek a weight vector $w \in \mathbb{R}^d$ minimizing the sum of squared residuals.

Definition

Ordinary Least Squares $\overset{w}{^}_{OLS}$

The OLS estimator minimizes the squared loss:

$\hat{w}_{\text{OLS}} = \arg\min_{w \in \mathbb{R}^d} \|y - Xw\|_2^2$

Setting the gradient to zero yields the normal equations $X^\top X \hat{w} = X^\top y$ . When $X^\top X$ is invertible, the closed form is

$\hat{w}_{\text{OLS}} = (X^\top X)^{-1} X^\top y.$

The derivation is below.

Deriving the Normal Equations

The OLS objective $L(w) = \|y - Xw\|_2^2$ is a quadratic function of $w$ . The derivation is short enough to walk through entirely.

Step 1: Expand the squared norm. Using $\|v\|_2^2 = v^\top v$ ,

L(w) = (y - Xw)^\top (y - Xw) = y^\top y - 2\, w^\top X^\top y + w^\top X^\top X\, w.

The cross term $-2\, w^\top X^\top y$ uses the fact that $y^\top X w = (X w)^\top y = w^\top X^\top y$ , so the two off-diagonal blocks combine into a single $-2$ coefficient.

Step 2: Compute the gradient. Standard matrix-calculus identities give $\nabla_w (a^\top w) = a$ and $\nabla_w (w^\top A w) = 2 A w$ when $A$ is symmetric (and $X^\top X$ is symmetric by construction). So

\nabla_w L(w) = -2\, X^\top y + 2\, X^\top X\, w.

Step 3: Set the gradient to zero. A first-order optimum satisfies $\nabla_w L(\hat w) = 0$ :

X^\top X\, \hat w = X^\top y. \qquad \text{(Normal equations.)}

The objective is convex in $w$ (its Hessian $2 X^\top X$ is positive semidefinite), so any first-order optimum is a global minimum.

Step 4: Verify and invert (when full rank). If $X$ has full column rank, $X^\top X$ is positive definite (not just semidefinite) and therefore invertible. Multiplying both sides by $(X^\top X)^{-1}$ gives the closed form

\hat{w}_{\text{OLS}} = (X^\top X)^{-1} X^\top y.

The matrix $X^+ \equiv (X^\top X)^{-1} X^\top$ is the Moore-Penrose pseudoinverse of $X$ in the full-rank case; the closed form is $\hat{w}_{\text{OLS}} = X^+ y$ . When $X$ is rank-deficient, $X^\top X$ is singular and the same pseudoinverse formula extends via the SVD $X = U \Sigma V^\top$ to $X^+ = V \Sigma^+ U^\top$ , where $\Sigma^+$ inverts the nonzero singular values and zeros out the rest. This is the minimum-norm OLS solution.

Geometric reading. $X^\top \hat e = 0$ — the residual is orthogonal to every column of $X$ . That is the entire content of OLS in one line: the residual is perpendicular to the column space. The normal equations are not a separate algebraic fact; they are the orthogonality condition rewritten in matrix form. See the next section for the projection picture.

Definition

Hat Matrix $H$

The hat matrix (or projection matrix) is:

$H = X(X^\top X)^{-1} X^\top$

It projects $y$ onto the column space of $X$ : the fitted values are $\hat{y} = Hy$ . The matrix "puts the hat on $y$ ."

Key properties: $H$ is symmetric and idempotent ( $H^2 = H$ ), $\text{tr}(H) = d$ , and eigenvalues are all 0 or 1.

Definition

Residuals $e$

The residual vector is:

$e = y - \hat{y} = (I - H)y$

By the normal equations, $X^\top e = 0$ . Residuals are orthogonal to every column of $X$ . This is the geometric content of OLS.

Geometric Interpretation: Projection onto the Column Space

OLS picks ŷ ∈ col(X) closest to y. The residual e = y − ŷ is orthogonal to col(X) — that is the entire content of the normal equations Xᵀe = 0.

The OLS solution has a direct geometric meaning. The column space of $X$ , denoted $\text{col}(X)$ , is a $d$ -dimensional subspace of $\mathbb{R}^n$ . The fitted vector $\hat{y} = Xw$ must lie in $\text{col}(X)$ .

OLS finds the point in $\text{col}(X)$ closest to $y$ in Euclidean distance. By the projection theorem in inner product spaces, this is the orthogonal projection of $y$ onto $\text{col}(X)$ . The residual $e = y - \hat{y}$ is perpendicular to $\text{col}(X)$ , which is exactly the statement $X^\top e = 0$ .

The hat matrix $H = X(X^\top X)^{-1}X^\top$ is the orthogonal projection matrix onto $\text{col}(X)$ . Its key properties follow directly from the geometry:

Idempotent: $H^2 = H$ . Projecting twice is the same as projecting once.
Symmetric: $H = H^\top$ . Orthogonal projections are self-adjoint.
Eigenvalues 0 or 1: Vectors in $\text{col}(X)$ satisfy $Hv = v$ ; vectors in $\text{col}(X)^\perp$ satisfy $Hv = 0$ .
Trace: $\text{tr}(H) = d$ , the dimension of the projected subspace.

The complementary projector $I - H$ projects onto $\text{col}(X)^\perp$ , the orthogonal complement. Residuals satisfy $e = (I - H)y$ , so $e \perp \hat{y}$ and the Pythagorean theorem gives $\|y\|^2 = \|\hat{y}\|^2 + \|e\|^2$ , which is the ANOVA decomposition of total sum of squares.

The matrix $H$ is positive semidefinite with eigenvalues in $\{0, 1\}$ ; the diagonal entry $h_{ii}$ (the leverage of observation $i$ ) satisfies $0 \leq h_{ii} \leq 1$ and measures how much the $i$ -th observation "controls" its own fitted value. An observation with $h_{ii} = 1$ would exactly determine its own prediction regardless of other observations.

When $X^\top X$ is not invertible (rank-deficient design matrix), the Moore-Penrose pseudoinverse $X^+$ replaces $(X^\top X)^{-1}X^\top$ , and $H = XX^+$ still defines an orthogonal projection, but now onto the column space of $X$ , which has dimension less than $d$ .

Ridge Regression as Regularized OLS

Definition

Ridge Regression $\overset{w}{^}_{ridge}$

Ridge regression adds an $\ell_2$ penalty:

$\hat{w}_{\text{ridge}} = \arg\min_w \|y - Xw\|_2^2 + \lambda \|w\|_2^2$

The closed-form solution is:

$\hat{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$

The addition of $\lambda I$ ensures invertibility and shrinks the estimate toward zero, trading bias for reduced variance.

Main Theorems

Theorem

Gauss-Markov Theorem

Statement

Under the linear model $y = Xw + \varepsilon$ where $\mathbb{E}[\varepsilon] = 0$ and $\text{Var}(\varepsilon) = \sigma^2 I$ , the OLS estimator $\hat{w}_{\text{OLS}}$ is the Best Linear Unbiased Estimator (BLUE). That is, among all unbiased estimators that are linear in $y$ , OLS has the smallest variance (in the matrix sense):

$\text{Var}(\tilde{w}) - \text{Var}(\hat{w}_{\text{OLS}}) \succeq 0$

for any other linear unbiased estimator $\tilde{w}$ .

Intuition

OLS is not just an unbiased linear estimator. It is the best one. You cannot reduce variance by using a different linear combination of $y$ without introducing bias. If you want lower variance, you must either accept bias (ridge regression) or use nonlinear methods.

Proof Sketch

Let $\tilde{w} = Cy$ be any linear unbiased estimator. Unbiasedness requires $CX = I$ . Write $C = (X^\top X)^{-1}X^\top + D$ where $DX = 0$ . Then

$\operatorname{Var}(\tilde{w}) = \sigma^2(X^\top X)^{-1} + \sigma^2 DD^\top \succeq \sigma^2(X^\top X)^{-1} = \operatorname{Var}(\hat{w}_{\text{OLS}}),$

since $DD^\top$ is positive semidefinite. Equality holds iff $D = 0$ , i.e. $\tilde{w} = \hat{w}_{\text{OLS}}$ .

Why It Matters

The Gauss-Markov theorem tells you exactly what you give up by regularizing. Ridge regression is biased, so it falls outside the Gauss-Markov scope. But the bias-variance tradeoff can still make it better in terms of mean squared error. The theorem defines the frontier of what unbiased linear estimation can achieve.

Failure Mode

If errors are heteroscedastic or correlated (so the homoscedasticity assumption $\operatorname{Var}(\varepsilon) = \sigma^2 I$ fails), OLS is no longer BLUE. Generalized least squares (GLS) reclaims optimality by accounting for the error covariance structure.

report a correction →

Connection to Maximum Likelihood

Under the Gaussian noise model $y = Xw + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ , the log-likelihood is:

$\log p(y \mid X, w, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|y - Xw\|_2^2$

Maximizing over $w$ is equivalent to minimizing $\|y - Xw\|_2^2$ , which is exactly OLS. So OLS = MLE under Gaussian noise. This also means ridge regression corresponds to MAP estimation with a Gaussian prior on $w$ , and the regularization strength $\lambda$ is the inverse prior variance up to a $\sigma^2$ factor.

Bias-Variance Tradeoff: A Numeric Walkthrough

The OLS estimator is unbiased under the Gauss-Markov assumptions, but its variance can be uncomfortably large when the design is poorly conditioned or when $n$ is close to $d$ . Ridge regression trades a controlled amount of bias for a substantial variance reduction. The numbers below make this concrete.

Setup. $n = 50$ observations, $d = 40$ features, true coefficients $w^*$ with $\|w^*\|^2 = 4$ , Gaussian noise $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ with $\sigma = 1$ . Use an orthonormal-column design $X^\top X = n I$ so the formulas stay clean (real designs with $n$ near $d$ behave qualitatively the same way; the orthonormal case is a clean special case).

Closed forms for the ridge estimator. With the orthonormal design,

\hat{w}_\lambda = (n + \lambda)^{-1}\bigl(n\, w^* + X^\top \varepsilon\bigr), \quad \mathbb{E}[\hat{w}_\lambda] = \frac{n}{n + \lambda}\, w^*.

Bias squared: $\|w^* - \mathbb{E}[\hat{w}_\lambda]\|^2 = \bigl(\lambda/(n + \lambda)\bigr)^2 \|w^*\|^2$ .
Variance: $\mathbb{E}\|\hat{w}_\lambda - \mathbb{E}[\hat{w}_\lambda]\|^2 = n\sigma^2 d / (n + \lambda)^2$ .
Mean squared error: bias² + variance.

Optimal $\lambda^*$ from $\mathrm{d}\, \mathrm{MSE}/\mathrm{d}\lambda = 0$ : $\lambda^* = \sigma^2 d / \|w^*\|^2 = 40 / 4 = 10$ .

Numeric ledger. Plugging the closed forms with $n = 50$ , $d = 40$ , $\sigma^2 = 1$ , $\|w^*\|^2 = 4$ :

$\lambda$	Bias²	Variance	MSE	Note
0	0.000	0.800	0.800	OLS — unbiased, all variance
1	0.002	0.769	0.770	tiny bias, slightly lower MSE
5	0.033	0.661	0.694	within 5% of optimal
10	0.111	0.556	0.667	$\lambda^* = \sigma^2 d / \\|w^*\\|^2$ — minimum MSE
25	0.444	0.356	0.800	matches OLS MSE; bias dominates
50	1.000	0.200	1.200	over-shrunk; variance reduction no longer pays
100	1.778	0.089	1.867	strong shrinkage; MSE much worse than OLS

Reading the table. OLS sits at $\lambda = 0$ with zero bias and MSE 0.800. The optimal ridge shrinkage at $\lambda^* = 10$ accepts a bias-squared cost of 0.111 to drop variance from 0.800 to 0.556, netting a 17% MSE improvement. Beyond $\lambda^*$ the bias compounds quadratically and ridge becomes strictly worse than OLS. The $\lambda = 25$ row is the equal-MSE point on the bias-heavy side; past that, you are paying for shrinkage you no longer need.

The same shape — a U-curve in MSE with the bottom at $\lambda^* > 0$ — appears in non-orthonormal designs and is the bias-variance side of the bias-variance tradeoff. The orthonormal case is special because $\lambda^*$ has a closed form; in practice you locate it by cross-validation, but the structural picture is the same.

Canonical Examples

Example

Simple linear regression in 2D

With a single feature and intercept, $X = [\mathbf{1}, x]$ where $x$ is the feature vector. The normal equations give the familiar slope and intercept:

$\hat{\beta}_1 = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

The hat matrix $H$ has diagonal entries $h_{ii}$ called leverages. Points with high leverage (far from $\bar{x}$ ) have outsized influence on the fit.

Example

Polynomial regression as linear regression

Fitting a polynomial $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots$ is still linear regression. The design matrix has columns $[1, x, x^2, \ldots]$ . The model is linear in the parameters, not the features. This is why the term "linear" in linear regression refers to linearity in $w$ , not in $x$ .

Example

OLS on XOR by hand: every linear fit predicts the mean

The four-point XOR dataset is the canonical case of a linearly inseparable classification problem. Working through OLS on it by hand makes the failure mode concrete and sets up the bridge to nonlinear models. Take

$X_{\text{aug}} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 0 \end{bmatrix}$

with the leading column of ones absorbing the intercept. Compute the Gram matrix and the right-hand side directly:

$X_{\text{aug}}^\top X_{\text{aug}} = \begin{bmatrix} 4 & 2 & 2 \\ 2 & 2 & 1 \\ 2 & 1 & 2 \end{bmatrix}, \quad X_{\text{aug}}^\top \mathbf{y} = \begin{bmatrix} 2 \\ 1 \\ 1 \end{bmatrix}.$

Solving $(X_{\text{aug}}^\top X_{\text{aug}})\, \hat{\beta} = X_{\text{aug}}^\top \mathbf{y}$ row by row gives $\hat{\beta} = (1/2,\, 0,\, 0)^\top$ . The intercept absorbs the mean of $\mathbf{y}$ and both slopes are exactly zero. Predictions are the constant $\hat{y}_i = 1/2$ on every input.

Drop the intercept and refit with $X = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix}$ . Then $X^\top X = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}$ , $X^\top \mathbf{y} = (1, 1)^\top$ , $(X^\top X)^{-1} = \tfrac{1}{3}\begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}$ , and $\hat{w} = (1/3,\, 1/3)^\top$ . Predictions are $0,\, 1/3,\, 1/3,\, 2/3$ : better than the constant fit but still wrong on three of four points.

Why this happens has a one-line proof. Both feature columns are mean-centered to $1/2$ , and both have zero sample covariance with $\mathbf{y}$ : $\widehat{\mathrm{Cov}}(x_1, y) = \widehat{\mathrm{Cov}}(x_2, y) = 0$ . OLS slope coefficients are solely a function of these covariances. Any transformation of $\mathbf{y}$ that keeps covariance zero gets the same fit; XOR is the cleanest case where that happens. The diagram at the top of this page shows three candidate linear separators; each gets at least two of the four points wrong. Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), §3.2 walks through the same geometry. For the resolution, see feedforward networks and backpropagation, which uses this same dataset to motivate the 2-hidden-unit ReLU fix.

Common Confusions

Watch Out

OLS minimizes squared residuals, not perpendicular distances

A frequent misconception is that OLS minimizes the perpendicular (orthogonal) distance from each point to the regression line. It does not. OLS minimizes the vertical distances (residuals in $y$ ). Minimizing perpendicular distances gives total least squares (or orthogonal regression), which is a different estimator. The distinction matters when both $x$ and $y$ have measurement error.

Watch Out

Invertibility of X^T X is not guaranteed

The normal equations require $X^\top X$ to be invertible, which happens when $X$ has full column rank ( $\operatorname{rank}(X) = d$ ). This fails when features are linearly dependent or when $n < d$ . Ridge regression fixes this: the regularized normal-equation matrix $X^\top X + \lambda I$ is always invertible for $\lambda > 0$ .

Summary

The normal equations $X^\top X w = X^\top y$ are the first-order optimality conditions for least squares
OLS is an orthogonal projection of $y$ onto the column space of $X$
The hat matrix $H = X(X^\top X)^{-1}X^\top$ maps $y$ to fitted values
Gauss-Markov: OLS is BLUE under homoscedastic uncorrelated errors
OLS = MLE under Gaussian noise; ridge = MAP with Gaussian prior
Residuals are orthogonal to every predictor: $X^\top e = 0$

Exercises

ExerciseCore

Problem

Show that the residual vector $e = y - X\hat{w}$ satisfies $X^\top e = 0$ . What does this mean geometrically?

ExerciseCore

Problem

For ridge regression with penalty $\lambda$ , show that the solution can be written as $\hat{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$ . What happens as $\lambda \to 0$ and $\lambda \to \infty$ ?

ExerciseAdvanced

Problem

Prove that if $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ and $y = Xw + \varepsilon$ , then the MLE for $w$ is exactly $\hat{w}_{\text{OLS}}$ . What is the MLE for $\sigma^2$ ?

Related Comparisons

Cross-Entropy vs. MSE Loss

References

Canonical:

Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3 (Linear Methods for Regression)
Seber & Lee, Linear Regression Analysis (2003), Chapters 3-4 (projection and hat matrix properties)
Wasserman, All of Statistics (2004), Chapter 13
Golub & Van Loan, Matrix Computations (4th ed., 2013), Chapter 5 (orthogonal projections and QR decomposition for OLS)

Current:

Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 11 (linear models, Bayesian interpretation)
Bishop, Pattern Recognition and Machine Learning (2006), Chapter 3 (Linear Models for Regression)
Strang, Introduction to Linear Algebra (5th ed., 2016), Chapter 4 (orthogonality and projections)

Next Topics

The natural next steps from linear regression:

Ridge regression: what happens when you trade bias for variance
Logistic regression: extending the linear model to classification
Bias-variance tradeoff: the general principle behind regularization
Inner product spaces and orthogonality: the abstract framework that makes the projection interpretation rigorous
Positive semidefinite matrices: the hat matrix and the covariance matrix $\sigma^2 (X^\top X)^{-1}$ are both PSD

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Matrix Operations and Propertieslayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)layer 0B · tier 1
Naive Bayeslayer 1 · tier 2

Derived topics

18

Analysis of Variancelayer 1 · tier 1
Data Preprocessing and Feature Engineeringlayer 1 · tier 1
Logistic Regressionlayer 1 · tier 1
Ridge Regressionlayer 1 · tier 1
Bayesian Linear Regressionlayer 2 · tier 1

+13 more on the derived-topics page.

Graph-backed continuations

Ridge Regression Logistic Regression Bias-Variance Tradeoff Cubist and Model Trees Data Preprocessing and Feature Engineering Feature Importance and Interpretability Gauss-Markov Theorem Generalized Additive Models GREG Estimator Implicit Bias and Modern Generalization Lasso Regression Longitudinal Surveys and Panel Data MARS (Multivariate Adaptive Regression Splines)REML and Variance Component Estimation Small Area Estimation Time Series Forecasting Basics Bayesian Linear Regression