Skip to main content

ML Methods

Linear Regression

Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.

CoreTier 1StableCore spine~50 min

Why This Matters

theorem visual

XOR is not linearly separable

Three candidate hyperplanes on the left always misclassify at least one point. A 2-hidden-unit ReLU network on the right places both positive-class points (0,1) and (1,0) inside the strip 0.5 < x_1 + x_2 < 1.5 and both negative points outside it.

no linear separator worksx₁x₂0011x_1 = 0.52 wrongx_2 = 0.52 wrongx_1 + x_2 = 13 wrong01102-hidden-unit ReLU bends a stripx₁x₂0011x₁ + x₂ = 0.5x₁ + x₂ = 1.50110y = 1 (positive class)y = 0 (negative class)

The OLS fit on this dataset, with intercept, is — the model predicts the mean at every point. Without intercept, , predicting on the four corners. Both confirm: no linear separator exists.

Linear regression is the most fundamental supervised learning method. Every idea in regression (projection, residuals, bias-variance, regularization) generalizes directly to more complex models. If you understand linear regression at the level of linear algebra — not just "fit a line" — you have the skeleton key for half of statistical learning.

The right way to think about OLS is not as a trend line drawn through points on a 2D scatterplot. The right picture is higher-dimensional: the design matrix XX spans a subspace of Rn\mathbb{R}^n, the response vector yy lives somewhere in that ambient space, and the fitted response y^\hat y is the orthogonal projection of yy onto col(X)\operatorname{col}(X). Everything else follows from that projection picture: the normal equations, the hat matrix, residual orthogonality, leverage, and the variance formula.

That is why linear regression remains foundational even when nobody actually cares about straight lines. The same ideas reappear in generalized linear models, kernel methods, ridge regression, Gaussian process regression, and modern overparameterized interpolation theory.

Quick Version

  • Optimization view. OLS picks the weight vector minimizing squared residuals.
  • Geometry view. The fitted response y^\hat y is the orthogonal projection of yy onto the column space of XX.
  • Algebra view. The normal equations XXw^=XyX^\top X \hat w = X^\top y are the first-order condition for that projection problem.
  • Statistics view. Under the Gauss-Markov assumptions, OLS is the best linear unbiased estimator, and under Gaussian noise it is also the maximum likelihood estimator.

Mental Model

You have data points in a high-dimensional space. The columns of your design matrix XX span a subspace. OLS finds the point in that subspace closest to your target vector yy. This is an orthogonal projection. The residuals are the component of yy orthogonal to the column space of XX.

Formal Setup

We observe nn input-output pairs. Stack the inputs into a design matrix XRn×dX \in \mathbb{R}^{n \times d} and responses into yRny \in \mathbb{R}^n. We seek a weight vector wRdw \in \mathbb{R}^d minimizing the sum of squared residuals.

Definition

Ordinary Least Squares

The OLS estimator minimizes the squared loss:

w^OLS=argminwRdyXw22\hat{w}_{\text{OLS}} = \arg\min_{w \in \mathbb{R}^d} \|y - Xw\|_2^2

Setting the gradient to zero yields the normal equations XXw^=XyX^\top X \hat{w} = X^\top y. When XXX^\top X is invertible, the closed form is

w^OLS=(XX)1Xy.\hat{w}_{\text{OLS}} = (X^\top X)^{-1} X^\top y.

The derivation is below.

Deriving the Normal Equations

The OLS objective L(w)=yXw22L(w) = \|y - Xw\|_2^2 is a quadratic function of ww. The derivation is short enough to walk through entirely.

Step 1: Expand the squared norm. Using v22=vv\|v\|_2^2 = v^\top v,

L(w)=(yXw)(yXw)=yy2wXy+wXXw.L(w) = (y - Xw)^\top (y - Xw) = y^\top y - 2\, w^\top X^\top y + w^\top X^\top X\, w.

The cross term 2wXy-2\, w^\top X^\top y uses the fact that yXw=(Xw)y=wXyy^\top X w = (X w)^\top y = w^\top X^\top y, so the two off-diagonal blocks combine into a single 2-2 coefficient.

Step 2: Compute the gradient. Standard matrix-calculus identities give w(aw)=a\nabla_w (a^\top w) = a and w(wAw)=2Aw\nabla_w (w^\top A w) = 2 A w when AA is symmetric (and XXX^\top X is symmetric by construction). So

wL(w)=2Xy+2XXw.\nabla_w L(w) = -2\, X^\top y + 2\, X^\top X\, w.

Step 3: Set the gradient to zero. A first-order optimum satisfies wL(w^)=0\nabla_w L(\hat w) = 0:

XXw^=Xy.(Normal equations.)X^\top X\, \hat w = X^\top y. \qquad \text{(Normal equations.)}

The objective is convex in ww (its Hessian 2XX2 X^\top X is positive semidefinite), so any first-order optimum is a global minimum.

Step 4: Verify and invert (when full rank). If XX has full column rank, XXX^\top X is positive definite (not just semidefinite) and therefore invertible. Multiplying both sides by (XX)1(X^\top X)^{-1} gives the closed form

w^OLS=(XX)1Xy.\hat{w}_{\text{OLS}} = (X^\top X)^{-1} X^\top y.

The matrix X+(XX)1XX^+ \equiv (X^\top X)^{-1} X^\top is the Moore-Penrose pseudoinverse of XX in the full-rank case; the closed form is w^OLS=X+y\hat{w}_{\text{OLS}} = X^+ y. When XX is rank-deficient, XXX^\top X is singular and the same pseudoinverse formula extends via the SVD X=UΣVX = U \Sigma V^\top to X+=VΣ+UX^+ = V \Sigma^+ U^\top, where Σ+\Sigma^+ inverts the nonzero singular values and zeros out the rest. This is the minimum-norm OLS solution.

Geometric reading. Xe^=0X^\top \hat e = 0 — the residual is orthogonal to every column of XX. That is the entire content of OLS in one line: the residual is perpendicular to the column space. The normal equations are not a separate algebraic fact; they are the orthogonality condition rewritten in matrix form. See the next section for the projection picture.

Definition

Hat Matrix

The hat matrix (or projection matrix) is:

H=X(XX)1XH = X(X^\top X)^{-1} X^\top

It projects yy onto the column space of XX: the fitted values are y^=Hy\hat{y} = Hy. The matrix "puts the hat on yy."

Key properties: HH is symmetric and idempotent (H2=HH^2 = H), tr(H)=d\text{tr}(H) = d, and eigenvalues are all 0 or 1.

Definition

Residuals

The residual vector is:

e=yy^=(IH)ye = y - \hat{y} = (I - H)y

By the normal equations, Xe=0X^\top e = 0. Residuals are orthogonal to every column of XX. This is the geometric content of OLS.

Geometric Interpretation: Projection onto the Column Space

OLS picks ŷ ∈ col(X) closest to y. The residual e = y − ŷ is orthogonal to col(X) — that is the entire content of the normal equations Xᵀe = 0.

col(X)a d-dim subspace of ℝⁿOŷye= y − ŷOLS as projectionHat-matrix formŷ = X(XᵀX)⁻¹Xᵀy = HyNormal equationsXᵀ e = 0e ⊥ every column of XPythagorean / ANOVA‖y‖² = ‖ŷ‖² + ‖e‖²Reading the figureThe amber plane is col(X), ad-dim subspace of ℝⁿ. y sitsin ℝⁿ; ŷ is the closest pointto y inside the plane.The residual e is the gap,drawn perpendicular to theplane (the right-angle box).

The OLS solution has a direct geometric meaning. The column space of XX, denoted col(X)\text{col}(X), is a dd-dimensional subspace of Rn\mathbb{R}^n. The fitted vector y^=Xw\hat{y} = Xw must lie in col(X)\text{col}(X).

OLS finds the point in col(X)\text{col}(X) closest to yy in Euclidean distance. By the projection theorem in inner product spaces, this is the orthogonal projection of yy onto col(X)\text{col}(X). The residual e=yy^e = y - \hat{y} is perpendicular to col(X)\text{col}(X), which is exactly the statement Xe=0X^\top e = 0.

The hat matrix H=X(XX)1XH = X(X^\top X)^{-1}X^\top is the orthogonal projection matrix onto col(X)\text{col}(X). Its key properties follow directly from the geometry:

  • Idempotent: H2=HH^2 = H. Projecting twice is the same as projecting once.
  • Symmetric: H=HH = H^\top. Orthogonal projections are self-adjoint.
  • Eigenvalues 0 or 1: Vectors in col(X)\text{col}(X) satisfy Hv=vHv = v; vectors in col(X)\text{col}(X)^\perp satisfy Hv=0Hv = 0.
  • Trace: tr(H)=d\text{tr}(H) = d, the dimension of the projected subspace.

The complementary projector IHI - H projects onto col(X)\text{col}(X)^\perp, the orthogonal complement. Residuals satisfy e=(IH)ye = (I - H)y, so ey^e \perp \hat{y} and the Pythagorean theorem gives y2=y^2+e2\|y\|^2 = \|\hat{y}\|^2 + \|e\|^2, which is the ANOVA decomposition of total sum of squares.

The matrix HH is positive semidefinite with eigenvalues in {0,1}\{0, 1\}; the diagonal entry hiih_{ii} (the leverage of observation ii) satisfies 0hii10 \leq h_{ii} \leq 1 and measures how much the ii-th observation "controls" its own fitted value. An observation with hii=1h_{ii} = 1 would exactly determine its own prediction regardless of other observations.

When XXX^\top X is not invertible (rank-deficient design matrix), the Moore-Penrose pseudoinverse X+X^+ replaces (XX)1X(X^\top X)^{-1}X^\top, and H=XX+H = XX^+ still defines an orthogonal projection, but now onto the column space of XX, which has dimension less than dd.

Ridge Regression as Regularized OLS

Definition

Ridge Regression

Ridge regression adds an 2\ell_2 penalty:

w^ridge=argminwyXw22+λw22\hat{w}_{\text{ridge}} = \arg\min_w \|y - Xw\|_2^2 + \lambda \|w\|_2^2

The closed-form solution is:

w^ridge=(XX+λI)1Xy\hat{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y

The addition of λI\lambda I ensures invertibility and shrinks the estimate toward zero, trading bias for reduced variance.

Main Theorems

Theorem

Gauss-Markov Theorem

Statement

Under the linear model y=Xw+εy = Xw + \varepsilon where E[ε]=0\mathbb{E}[\varepsilon] = 0 and Var(ε)=σ2I\text{Var}(\varepsilon) = \sigma^2 I, the OLS estimator w^OLS\hat{w}_{\text{OLS}} is the Best Linear Unbiased Estimator (BLUE). That is, among all unbiased estimators that are linear in yy, OLS has the smallest variance (in the matrix sense):

Var(w~)Var(w^OLS)0\text{Var}(\tilde{w}) - \text{Var}(\hat{w}_{\text{OLS}}) \succeq 0

for any other linear unbiased estimator w~\tilde{w}.

Intuition

OLS is not just an unbiased linear estimator. It is the best one. You cannot reduce variance by using a different linear combination of yy without introducing bias. If you want lower variance, you must either accept bias (ridge regression) or use nonlinear methods.

Proof Sketch

Let w~=Cy\tilde{w} = Cy be any linear unbiased estimator. Unbiasedness requires CX=ICX = I. Write C=(XX)1X+DC = (X^\top X)^{-1}X^\top + D where DX=0DX = 0. Then

Var(w~)=σ2(XX)1+σ2DDσ2(XX)1=Var(w^OLS),\operatorname{Var}(\tilde{w}) = \sigma^2(X^\top X)^{-1} + \sigma^2 DD^\top \succeq \sigma^2(X^\top X)^{-1} = \operatorname{Var}(\hat{w}_{\text{OLS}}),

since DDDD^\top is positive semidefinite. Equality holds iff D=0D = 0, i.e. w~=w^OLS\tilde{w} = \hat{w}_{\text{OLS}}.

Why It Matters

The Gauss-Markov theorem tells you exactly what you give up by regularizing. Ridge regression is biased, so it falls outside the Gauss-Markov scope. But the bias-variance tradeoff can still make it better in terms of mean squared error. The theorem defines the frontier of what unbiased linear estimation can achieve.

Failure Mode

If errors are heteroscedastic or correlated (so the homoscedasticity assumption Var(ε)=σ2I\operatorname{Var}(\varepsilon) = \sigma^2 I fails), OLS is no longer BLUE. Generalized least squares (GLS) reclaims optimality by accounting for the error covariance structure.

Connection to Maximum Likelihood

Under the Gaussian noise model y=Xw+εy = Xw + \varepsilon with εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I), the log-likelihood is:

logp(yX,w,σ2)=n2log(2πσ2)12σ2yXw22\log p(y \mid X, w, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|y - Xw\|_2^2

Maximizing over ww is equivalent to minimizing yXw22\|y - Xw\|_2^2, which is exactly OLS. So OLS = MLE under Gaussian noise. This also means ridge regression corresponds to MAP estimation with a Gaussian prior on ww, and the regularization strength λ\lambda is the inverse prior variance up to a σ2\sigma^2 factor.

Bias-Variance Tradeoff: A Numeric Walkthrough

The OLS estimator is unbiased under the Gauss-Markov assumptions, but its variance can be uncomfortably large when the design is poorly conditioned or when nn is close to dd. Ridge regression trades a controlled amount of bias for a substantial variance reduction. The numbers below make this concrete.

Setup. n=50n = 50 observations, d=40d = 40 features, true coefficients ww^* with w2=4\|w^*\|^2 = 4, Gaussian noise εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I) with σ=1\sigma = 1. Use an orthonormal-column design XX=nIX^\top X = n I so the formulas stay clean (real designs with nn near dd behave qualitatively the same way; the orthonormal case is a clean special case).

Closed forms for the ridge estimator. With the orthonormal design,

w^λ=(n+λ)1(nw+Xε),E[w^λ]=nn+λw.\hat{w}_\lambda = (n + \lambda)^{-1}\bigl(n\, w^* + X^\top \varepsilon\bigr), \quad \mathbb{E}[\hat{w}_\lambda] = \frac{n}{n + \lambda}\, w^*.
  • Bias squared: wE[w^λ]2=(λ/(n+λ))2w2\|w^* - \mathbb{E}[\hat{w}_\lambda]\|^2 = \bigl(\lambda/(n + \lambda)\bigr)^2 \|w^*\|^2.
  • Variance: Ew^λE[w^λ]2=nσ2d/(n+λ)2\mathbb{E}\|\hat{w}_\lambda - \mathbb{E}[\hat{w}_\lambda]\|^2 = n\sigma^2 d / (n + \lambda)^2.
  • Mean squared error: bias² + variance.

Optimal λ\lambda^* from dMSE/dλ=0\mathrm{d}\, \mathrm{MSE}/\mathrm{d}\lambda = 0: λ=σ2d/w2=40/4=10\lambda^* = \sigma^2 d / \|w^*\|^2 = 40 / 4 = 10.

Numeric ledger. Plugging the closed forms with n=50n = 50, d=40d = 40, σ2=1\sigma^2 = 1, w2=4\|w^*\|^2 = 4:

λ\lambdaBias²VarianceMSENote
00.0000.8000.800OLS — unbiased, all variance
10.0020.7690.770tiny bias, slightly lower MSE
50.0330.6610.694within 5% of optimal
100.1110.5560.667λ=σ2d/w2\lambda^* = \sigma^2 d / \|w^*\|^2 — minimum MSE
250.4440.3560.800matches OLS MSE; bias dominates
501.0000.2001.200over-shrunk; variance reduction no longer pays
1001.7780.0891.867strong shrinkage; MSE much worse than OLS

Reading the table. OLS sits at λ=0\lambda = 0 with zero bias and MSE 0.800. The optimal ridge shrinkage at λ=10\lambda^* = 10 accepts a bias-squared cost of 0.111 to drop variance from 0.800 to 0.556, netting a 17% MSE improvement. Beyond λ\lambda^* the bias compounds quadratically and ridge becomes strictly worse than OLS. The λ=25\lambda = 25 row is the equal-MSE point on the bias-heavy side; past that, you are paying for shrinkage you no longer need.

The same shape — a U-curve in MSE with the bottom at λ>0\lambda^* > 0 — appears in non-orthonormal designs and is the bias-variance side of the bias-variance tradeoff. The orthonormal case is special because λ\lambda^* has a closed form; in practice you locate it by cross-validation, but the structural picture is the same.

Canonical Examples

Example

Simple linear regression in 2D

With a single feature and intercept, X=[1,x]X = [\mathbf{1}, x] where xx is the feature vector. The normal equations give the familiar slope and intercept:

β^1=i(xixˉ)(yiyˉ)i(xixˉ)2,β^0=yˉβ^1xˉ\hat{\beta}_1 = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

The hat matrix HH has diagonal entries hiih_{ii} called leverages. Points with high leverage (far from xˉ\bar{x}) have outsized influence on the fit.

Example

Polynomial regression as linear regression

Fitting a polynomial y=β0+β1x+β2x2+y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots is still linear regression. The design matrix has columns [1,x,x2,][1, x, x^2, \ldots]. The model is linear in the parameters, not the features. This is why the term "linear" in linear regression refers to linearity in ww, not in xx.

Example

OLS on XOR by hand: every linear fit predicts the mean

The four-point XOR dataset is the canonical case of a linearly inseparable classification problem. Working through OLS on it by hand makes the failure mode concrete and sets up the bridge to nonlinear models. Take

Xaug=[100101110111],y=[0110]X_{\text{aug}} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 0 \end{bmatrix}

with the leading column of ones absorbing the intercept. Compute the Gram matrix and the right-hand side directly:

XaugXaug=[422221212],Xaugy=[211].X_{\text{aug}}^\top X_{\text{aug}} = \begin{bmatrix} 4 & 2 & 2 \\ 2 & 2 & 1 \\ 2 & 1 & 2 \end{bmatrix}, \quad X_{\text{aug}}^\top \mathbf{y} = \begin{bmatrix} 2 \\ 1 \\ 1 \end{bmatrix}.

Solving (XaugXaug)β^=Xaugy(X_{\text{aug}}^\top X_{\text{aug}})\, \hat{\beta} = X_{\text{aug}}^\top \mathbf{y} row by row gives β^=(1/2,0,0)\hat{\beta} = (1/2,\, 0,\, 0)^\top. The intercept absorbs the mean of y\mathbf{y} and both slopes are exactly zero. Predictions are the constant y^i=1/2\hat{y}_i = 1/2 on every input.

Drop the intercept and refit with X=[00011011]X = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix}. Then XX=[2112]X^\top X = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}, Xy=(1,1)X^\top \mathbf{y} = (1, 1)^\top, (XX)1=13[2112](X^\top X)^{-1} = \tfrac{1}{3}\begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}, and w^=(1/3,1/3)\hat{w} = (1/3,\, 1/3)^\top. Predictions are 0,1/3,1/3,2/30,\, 1/3,\, 1/3,\, 2/3: better than the constant fit but still wrong on three of four points.

Why this happens has a one-line proof. Both feature columns are mean-centered to 1/21/2, and both have zero sample covariance with y\mathbf{y}: Cov^(x1,y)=Cov^(x2,y)=0\widehat{\mathrm{Cov}}(x_1, y) = \widehat{\mathrm{Cov}}(x_2, y) = 0. OLS slope coefficients are solely a function of these covariances. Any transformation of y\mathbf{y} that keeps covariance zero gets the same fit; XOR is the cleanest case where that happens. The diagram at the top of this page shows three candidate linear separators; each gets at least two of the four points wrong. Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), §3.2 walks through the same geometry. For the resolution, see feedforward networks and backpropagation, which uses this same dataset to motivate the 2-hidden-unit ReLU fix.

Common Confusions

Watch Out

OLS minimizes squared residuals, not perpendicular distances

A frequent misconception is that OLS minimizes the perpendicular (orthogonal) distance from each point to the regression line. It does not. OLS minimizes the vertical distances (residuals in yy). Minimizing perpendicular distances gives total least squares (or orthogonal regression), which is a different estimator. The distinction matters when both xx and yy have measurement error.

Watch Out

Invertibility of X^T X is not guaranteed

The normal equations require XXX^\top X to be invertible, which happens when XX has full column rank (rank(X)=d\operatorname{rank}(X) = d). This fails when features are linearly dependent or when n<dn < d. Ridge regression fixes this: the regularized normal-equation matrix XX+λIX^\top X + \lambda I is always invertible for λ>0\lambda > 0.

Summary

  • The normal equations XXw=XyX^\top X w = X^\top y are the first-order optimality conditions for least squares
  • OLS is an orthogonal projection of yy onto the column space of XX
  • The hat matrix H=X(XX)1XH = X(X^\top X)^{-1}X^\top maps yy to fitted values
  • Gauss-Markov: OLS is BLUE under homoscedastic uncorrelated errors
  • OLS = MLE under Gaussian noise; ridge = MAP with Gaussian prior
  • Residuals are orthogonal to every predictor: Xe=0X^\top e = 0

Exercises

ExerciseCore

Problem

Show that the residual vector e=yXw^e = y - X\hat{w} satisfies Xe=0X^\top e = 0. What does this mean geometrically?

ExerciseCore

Problem

For ridge regression with penalty λ\lambda, show that the solution can be written as w^ridge=(XX+λI)1Xy\hat{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y. What happens as λ0\lambda \to 0 and λ\lambda \to \infty?

ExerciseAdvanced

Problem

Prove that if εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I) and y=Xw+εy = Xw + \varepsilon, then the MLE for ww is exactly w^OLS\hat{w}_{\text{OLS}}. What is the MLE for σ2\sigma^2?

Related Comparisons

References

Canonical:

  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3 (Linear Methods for Regression)
  • Seber & Lee, Linear Regression Analysis (2003), Chapters 3-4 (projection and hat matrix properties)
  • Wasserman, All of Statistics (2004), Chapter 13
  • Golub & Van Loan, Matrix Computations (4th ed., 2013), Chapter 5 (orthogonal projections and QR decomposition for OLS)

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 11 (linear models, Bayesian interpretation)
  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 3 (Linear Models for Regression)
  • Strang, Introduction to Linear Algebra (5th ed., 2016), Chapter 4 (orthogonality and projections)

Next Topics

The natural next steps from linear regression:

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

18

+13 more on the derived-topics page.