ML Methods
Linear Regression
Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.
Prerequisites
Why This Matters
Linear regression is the most fundamental supervised learning method. Every idea in regression (projection, residuals, bias-variance, regularization) generalizes directly to more complex models. If you understand linear regression at the level of linear algebra — not just "fit a line" — you have the skeleton key for half of statistical learning.
The right way to think about OLS is not as a trend line drawn through points on a 2D scatterplot. The right picture is higher-dimensional: the design matrix spans a subspace of , the response vector lives somewhere in that ambient space, and the fitted response is the orthogonal projection of onto . Everything else follows from that projection picture: the normal equations, the hat matrix, residual orthogonality, leverage, and the variance formula.
That is why linear regression remains foundational even when nobody actually cares about straight lines. The same ideas reappear in generalized linear models, kernel methods, ridge regression, Gaussian process regression, and modern overparameterized interpolation theory.
Quick Version
- Optimization view. OLS picks the weight vector minimizing squared residuals.
- Geometry view. The fitted response is the orthogonal projection of onto the column space of .
- Algebra view. The normal equations are the first-order condition for that projection problem.
- Statistics view. Under the Gauss-Markov assumptions, OLS is the best linear unbiased estimator, and under Gaussian noise it is also the maximum likelihood estimator.
Mental Model
You have data points in a high-dimensional space. The columns of your design matrix span a subspace. OLS finds the point in that subspace closest to your target vector . This is an orthogonal projection. The residuals are the component of orthogonal to the column space of .
Formal Setup
We observe input-output pairs. Stack the inputs into a design matrix and responses into . We seek a weight vector minimizing the sum of squared residuals.
Ordinary Least Squares
The OLS estimator minimizes the squared loss:
Setting the gradient to zero yields the normal equations . When is invertible, the closed form is
The derivation is below.
Deriving the Normal Equations
The OLS objective is a quadratic function of . The derivation is short enough to walk through entirely.
Step 1: Expand the squared norm. Using ,
The cross term uses the fact that , so the two off-diagonal blocks combine into a single coefficient.
Step 2: Compute the gradient. Standard matrix-calculus identities give and when is symmetric (and is symmetric by construction). So
Step 3: Set the gradient to zero. A first-order optimum satisfies :
The objective is convex in (its Hessian is positive semidefinite), so any first-order optimum is a global minimum.
Step 4: Verify and invert (when full rank). If has full column rank, is positive definite (not just semidefinite) and therefore invertible. Multiplying both sides by gives the closed form
The matrix is the Moore-Penrose pseudoinverse of in the full-rank case; the closed form is . When is rank-deficient, is singular and the same pseudoinverse formula extends via the SVD to , where inverts the nonzero singular values and zeros out the rest. This is the minimum-norm OLS solution.
Geometric reading. — the residual is orthogonal to every column of . That is the entire content of OLS in one line: the residual is perpendicular to the column space. The normal equations are not a separate algebraic fact; they are the orthogonality condition rewritten in matrix form. See the next section for the projection picture.
Hat Matrix
The hat matrix (or projection matrix) is:
It projects onto the column space of : the fitted values are . The matrix "puts the hat on ."
Key properties: is symmetric and idempotent (), , and eigenvalues are all 0 or 1.
Residuals
The residual vector is:
By the normal equations, . Residuals are orthogonal to every column of . This is the geometric content of OLS.
Geometric Interpretation: Projection onto the Column Space
OLS picks ŷ ∈ col(X) closest to y. The residual e = y − ŷ is orthogonal to col(X) — that is the entire content of the normal equations Xᵀe = 0.
The OLS solution has a direct geometric meaning. The column space of , denoted , is a -dimensional subspace of . The fitted vector must lie in .
OLS finds the point in closest to in Euclidean distance. By the projection theorem in inner product spaces, this is the orthogonal projection of onto . The residual is perpendicular to , which is exactly the statement .
The hat matrix is the orthogonal projection matrix onto . Its key properties follow directly from the geometry:
- Idempotent: . Projecting twice is the same as projecting once.
- Symmetric: . Orthogonal projections are self-adjoint.
- Eigenvalues 0 or 1: Vectors in satisfy ; vectors in satisfy .
- Trace: , the dimension of the projected subspace.
The complementary projector projects onto , the orthogonal complement. Residuals satisfy , so and the Pythagorean theorem gives , which is the ANOVA decomposition of total sum of squares.
The matrix is positive semidefinite with eigenvalues in ; the diagonal entry (the leverage of observation ) satisfies and measures how much the -th observation "controls" its own fitted value. An observation with would exactly determine its own prediction regardless of other observations.
When is not invertible (rank-deficient design matrix), the Moore-Penrose pseudoinverse replaces , and still defines an orthogonal projection, but now onto the column space of , which has dimension less than .
Ridge Regression as Regularized OLS
Ridge Regression
Ridge regression adds an penalty:
The closed-form solution is:
The addition of ensures invertibility and shrinks the estimate toward zero, trading bias for reduced variance.
Main Theorems
Gauss-Markov Theorem
Statement
Under the linear model where and , the OLS estimator is the Best Linear Unbiased Estimator (BLUE). That is, among all unbiased estimators that are linear in , OLS has the smallest variance (in the matrix sense):
for any other linear unbiased estimator .
Intuition
OLS is not just an unbiased linear estimator. It is the best one. You cannot reduce variance by using a different linear combination of without introducing bias. If you want lower variance, you must either accept bias (ridge regression) or use nonlinear methods.
Proof Sketch
Let be any linear unbiased estimator. Unbiasedness requires . Write where . Then
since is positive semidefinite. Equality holds iff , i.e. .
Why It Matters
The Gauss-Markov theorem tells you exactly what you give up by regularizing. Ridge regression is biased, so it falls outside the Gauss-Markov scope. But the bias-variance tradeoff can still make it better in terms of mean squared error. The theorem defines the frontier of what unbiased linear estimation can achieve.
Failure Mode
If errors are heteroscedastic or correlated (so the homoscedasticity assumption fails), OLS is no longer BLUE. Generalized least squares (GLS) reclaims optimality by accounting for the error covariance structure.
Connection to Maximum Likelihood
Under the Gaussian noise model with , the log-likelihood is:
Maximizing over is equivalent to minimizing , which is exactly OLS. So OLS = MLE under Gaussian noise. This also means ridge regression corresponds to MAP estimation with a Gaussian prior on , and the regularization strength is the inverse prior variance up to a factor.
Bias-Variance Tradeoff: A Numeric Walkthrough
The OLS estimator is unbiased under the Gauss-Markov assumptions, but its variance can be uncomfortably large when the design is poorly conditioned or when is close to . Ridge regression trades a controlled amount of bias for a substantial variance reduction. The numbers below make this concrete.
Setup. observations, features, true coefficients with , Gaussian noise with . Use an orthonormal-column design so the formulas stay clean (real designs with near behave qualitatively the same way; the orthonormal case is a clean special case).
Closed forms for the ridge estimator. With the orthonormal design,
- Bias squared: .
- Variance: .
- Mean squared error: bias² + variance.
Optimal from : .
Numeric ledger. Plugging the closed forms with , , , :
| Bias² | Variance | MSE | Note | |
|---|---|---|---|---|
| 0 | 0.000 | 0.800 | 0.800 | OLS — unbiased, all variance |
| 1 | 0.002 | 0.769 | 0.770 | tiny bias, slightly lower MSE |
| 5 | 0.033 | 0.661 | 0.694 | within 5% of optimal |
| 10 | 0.111 | 0.556 | 0.667 | — minimum MSE |
| 25 | 0.444 | 0.356 | 0.800 | matches OLS MSE; bias dominates |
| 50 | 1.000 | 0.200 | 1.200 | over-shrunk; variance reduction no longer pays |
| 100 | 1.778 | 0.089 | 1.867 | strong shrinkage; MSE much worse than OLS |
Reading the table. OLS sits at with zero bias and MSE 0.800. The optimal ridge shrinkage at accepts a bias-squared cost of 0.111 to drop variance from 0.800 to 0.556, netting a 17% MSE improvement. Beyond the bias compounds quadratically and ridge becomes strictly worse than OLS. The row is the equal-MSE point on the bias-heavy side; past that, you are paying for shrinkage you no longer need.
The same shape — a U-curve in MSE with the bottom at — appears in non-orthonormal designs and is the bias-variance side of the bias-variance tradeoff. The orthonormal case is special because has a closed form; in practice you locate it by cross-validation, but the structural picture is the same.
Canonical Examples
Simple linear regression in 2D
With a single feature and intercept, where is the feature vector. The normal equations give the familiar slope and intercept:
The hat matrix has diagonal entries called leverages. Points with high leverage (far from ) have outsized influence on the fit.
Polynomial regression as linear regression
Fitting a polynomial is still linear regression. The design matrix has columns . The model is linear in the parameters, not the features. This is why the term "linear" in linear regression refers to linearity in , not in .
OLS on XOR by hand: every linear fit predicts the mean
The four-point XOR dataset is the canonical case of a linearly inseparable classification problem. Working through OLS on it by hand makes the failure mode concrete and sets up the bridge to nonlinear models. Take
with the leading column of ones absorbing the intercept. Compute the Gram matrix and the right-hand side directly:
Solving row by row gives . The intercept absorbs the mean of and both slopes are exactly zero. Predictions are the constant on every input.
Drop the intercept and refit with . Then , , , and . Predictions are : better than the constant fit but still wrong on three of four points.
Why this happens has a one-line proof. Both feature columns are mean-centered to , and both have zero sample covariance with : . OLS slope coefficients are solely a function of these covariances. Any transformation of that keeps covariance zero gets the same fit; XOR is the cleanest case where that happens. The diagram at the top of this page shows three candidate linear separators; each gets at least two of the four points wrong. Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), §3.2 walks through the same geometry. For the resolution, see feedforward networks and backpropagation, which uses this same dataset to motivate the 2-hidden-unit ReLU fix.
Common Confusions
OLS minimizes squared residuals, not perpendicular distances
A frequent misconception is that OLS minimizes the perpendicular (orthogonal) distance from each point to the regression line. It does not. OLS minimizes the vertical distances (residuals in ). Minimizing perpendicular distances gives total least squares (or orthogonal regression), which is a different estimator. The distinction matters when both and have measurement error.
Invertibility of X^T X is not guaranteed
The normal equations require to be invertible, which happens when has full column rank (). This fails when features are linearly dependent or when . Ridge regression fixes this: the regularized normal-equation matrix is always invertible for .
Summary
- The normal equations are the first-order optimality conditions for least squares
- OLS is an orthogonal projection of onto the column space of
- The hat matrix maps to fitted values
- Gauss-Markov: OLS is BLUE under homoscedastic uncorrelated errors
- OLS = MLE under Gaussian noise; ridge = MAP with Gaussian prior
- Residuals are orthogonal to every predictor:
Exercises
Problem
Show that the residual vector satisfies . What does this mean geometrically?
Problem
For ridge regression with penalty , show that the solution can be written as . What happens as and ?
Problem
Prove that if and , then the MLE for is exactly . What is the MLE for ?
Related Comparisons
References
Canonical:
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 3 (Linear Methods for Regression)
- Seber & Lee, Linear Regression Analysis (2003), Chapters 3-4 (projection and hat matrix properties)
- Wasserman, All of Statistics (2004), Chapter 13
- Golub & Van Loan, Matrix Computations (4th ed., 2013), Chapter 5 (orthogonal projections and QR decomposition for OLS)
Current:
- Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 11 (linear models, Bayesian interpretation)
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 3 (Linear Models for Regression)
- Strang, Introduction to Linear Algebra (5th ed., 2016), Chapter 4 (orthogonality and projections)
Next Topics
The natural next steps from linear regression:
- Ridge regression: what happens when you trade bias for variance
- Logistic regression: extending the linear model to classification
- Bias-variance tradeoff: the general principle behind regularization
- Inner product spaces and orthogonality: the abstract framework that makes the projection interpretation rigorous
- Positive semidefinite matrices: the hat matrix and the covariance matrix are both PSD
Last reviewed: May 4, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Matrix Operations and Propertieslayer 0A · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)layer 0B · tier 1
- Naive Bayeslayer 1 · tier 2
Derived topics
18- Analysis of Variancelayer 1 · tier 1
- Data Preprocessing and Feature Engineeringlayer 1 · tier 1
- Logistic Regressionlayer 1 · tier 1
- Ridge Regressionlayer 1 · tier 1
- Bayesian Linear Regressionlayer 2 · tier 1
+13 more on the derived-topics page.
Graph-backed continuations