ML Methods
Principal Component Analysis
Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.
Prerequisites
Why This Matters
PCA is the most widely used dimensionality reduction technique in all of data science. It appears everywhere: preprocessing for ML pipelines, visualization (project to 2D), noise reduction, feature extraction, genomics (population structure), finance (factor models), and image compression. Understanding PCA requires understanding the connection between three mathematical objects: the covariance matrix, its eigendecomposition, and the SVD of the data matrix — and knowing exactly how they relate.
Mental Model
You have data points in and want to find a low-dimensional subspace that captures as much of the variation in the data as possible. PCA finds the directions (principal components) along which the data varies most, ordered by decreasing variance. Projecting onto the top principal components gives the best rank- approximation to the centered data.
Setup and Notation
Let be the data matrix with each row a data point. Assume the data is centered: . If not, subtract the mean first.
Sample Covariance Matrix
The sample covariance matrix is:
This is symmetric positive semi-definite. Its eigenvalues are the variances along the principal directions.
Principal Components
The principal components are the eigenvectors of , ordered by decreasing eigenvalue. The -th principal component direction is the direction of the -th largest variance in the data, subject to being orthogonal to .
PCA as Variance Maximization
PCA Maximizes Projected Variance
Statement
The first principal component solves:
The maximum value is , the largest eigenvalue of . More generally, the -th principal component maximizes projected variance subject to orthogonality to the first components.
Intuition
Among all unit directions in , is the one along which the data has the most spread (variance). The second PC is the direction of most remaining spread after removing the component, and so on. Each eigenvalue tells you how much variance that direction captures.
Proof Sketch
We want . Form the Lagrangian . Setting gives , an eigenvalue equation. The objective at any eigenvector equals . So the maximum is achieved at . For subsequent PCs, add orthogonality constraints and use induction.
Why It Matters
This shows PCA has a clear statistical interpretation: it finds the directions that explain the most variance in the data. The eigenvalues quantify exactly how much variance each component captures, enabling the scree plot for choosing the number of components.
Failure Mode
PCA maximizes variance, not necessarily "importance." If the signal lives in a low-variance direction (e.g., a rare but meaningful feature), PCA will discard it in favor of high-variance noise. PCA also only captures linear structure; it cannot find nonlinear manifolds.
Worked Derivation: PCA as Reconstruction Error
The variance-maximization statement is the short route. The reconstruction route is the one that often feels harder: it starts from a squared-error projection objective, expands it with trace identities, and ends at the same eigenvector problem.
Use a unit direction with . The score vector gives one coordinate per example. The reconstruction back in the original feature space is:
The one-dimensional PCA reconstruction problem is:
Using ,
Expand the product:
The trace trick is only bookkeeping. It lets scalar expressions move into a shape that is easy to compare:
The first term is constant in , so it does not affect the minimizer. The two middle terms are equal by cyclicity of trace. The last term simplifies because has unit norm:
So the variable-dependent part is:
Minimizing reconstruction error is therefore equivalent to:
By the Rayleigh-Ritz theorem, the maximizing unit vector is an eigenvector of with the largest eigenvalue. This is the first principal component. Because , the same vector is also the top eigenvector of the sample covariance matrix; scaling by changes eigenvalues, not eigenvectors.
For , write with . The reconstruction is , and the same calculation gives:
Let , with eigenvalues . This is the exercise often left after the one-vector derivation: choose the best orthonormal -frame. In the eigenbasis of , write . Then , and
where , so and . Because the eigenvalues are sorted, this weighted sum is maximized by placing weight on the first eigenvalues and on the rest. Thus maximizes the objective, with value . This is the Ky Fan/Rayleigh-Ritz argument; the same idea can be read as induction on orthogonal complements.
Geometrically, one-component PCA picks the line through the mean with the smallest squared reconstruction arrows. Top- PCA picks the -dimensional orthogonal subspace with the smallest total squared reconstruction arrows.
PCA via SVD
The SVD of the centered data matrix where , , and .
The connection: .
So the right singular vectors are the principal component directions, and the squared singular values divided by are the eigenvalues of : .
In practice, always compute PCA via the SVD of , not by forming and computing its eigendecomposition. The SVD is numerically more stable because forming squares the condition number.
Truncated PCA and Reconstruction
Truncated PCA
The rank- PCA approximation keeps only the top principal components. The projection of a data point onto the -dimensional subspace is:
where .
Eckart-Young Theorem (Best Low-Rank Approximation)
Statement
The best rank- approximation to in the Frobenius norm is the truncated SVD :
The reconstruction error is:
where .
Intuition
Truncated PCA gives you the closest rank- matrix to your data. The reconstruction error is exactly the sum of squared singular values you threw away. This is why you look at the eigenvalue spectrum to decide how many components to keep.
Proof Sketch
By the SVD, . Any rank- matrix can capture at most singular directions. The Frobenius norm squared of is (Parseval). Keeping the largest singular values minimizes .
Why It Matters
This theorem justifies truncated PCA as optimal dimensionality reduction in the least-squares sense. It also gives a precise formula for reconstruction error in terms of the discarded eigenvalues, which is what the scree plot visualizes.
Failure Mode
Optimality is in the Frobenius norm sense. If your downstream task cares about something other than squared reconstruction error (e.g., classification accuracy), PCA may not give the best low-dimensional representation.
Choosing the Number of Components
Scree plot: Plot eigenvalues versus index. Look for an "elbow" where the eigenvalues drop sharply and then level off. Keep components before the elbow.
Proportion of variance explained: Keep components such that:
Kaiser's rule: Keep components with (the average eigenvalue). For correlation PCA, this means .
Connection to Matrix Concentration
When does sample PCA approximate population PCA? If with population covariance , then as . But the rate matters.
Matrix concentration inequalities (Vershynin 2018, Theorem 4.6.1) show that for sub-Gaussian data with population covariance , with high probability. The term dominates in the classical regime and matches the rate quoted in elementary references; the term takes over once is comparable to and is the right object to track in modern high-dimensional analyses. The Davis-Kahan theorem then bounds the angle between sample and population eigenvectors. PCA is reliable when and the eigenvalue gaps are large. In high-dimensional settings ( or ), sample PCA can be highly misleading: the top sample eigenvector may point in a completely wrong direction. This is one instance of the broader phenomenon where classical estimators break down when is not small; see double descent for analogous behavior in supervised learning.
Optional Extensions: Beyond Classical PCA
Classical PCA is a deterministic eigen-decomposition of the sample covariance. Five common variants relax or replace pieces of that recipe and earn their place under different conditions. Each is shipped here as a self-contained mini-section so this page can serve as a single landing for the family.
Probabilistic PCA (PPCA)
Tipping and Bishop (1999) gave PCA an explicit generative-model interpretation:
with latent , loading matrix , and isotropic noise . Marginalizing out :
The MLE for is , where holds the top- sample-covariance eigenvectors, the top- eigenvalues, and is an arbitrary orthogonal matrix (rotational ambiguity). The MLE for is the average of the discarded eigenvalues. As , the model collapses to classical PCA.
Why it matters. PPCA gives PCA a probabilistic vocabulary: a posterior over latents (useful for missing-data imputation), a marginal likelihood (model selection by Bayesian approaches), and natural EM fitting when data is partially observed. The MLE for is also a clean application of maximum-likelihood estimation to a latent-variable model. PPCA is the bridge from classical PCA to factor analysis, Gaussian-mixture latent-variable models, and Bayesian variants. Use PPCA when you need a generative prior over the data or have missing entries to handle.
Kernel PCA
Schölkopf, Smola, and Müller (1998) extended PCA to nonlinear feature spaces via the kernel trick. Replace the inner product with a kernel for some feature map . PCA in feature space requires only the centered Gram matrix , where . Eigendecompose to get coefficients; the -th nonlinear principal component evaluated at a new point is
with the (rescaled) eigenvector of .
Why it matters. Kernel PCA captures nonlinear structure (curved manifolds, clusters in feature space) that classical PCA cannot resolve in the original coordinates. The cost is kernel evaluations and an eigendecomposition rather than . Use kernel PCA when the nonlinear structure is dominant and is moderate; for very large , Nyström approximations or random-feature methods scale it.
Factor Analysis vs PCA
Factor analysis (FA) shares the structure but allows anisotropic noise:
Each variable has its own residual variance . The loading matrix captures common variance shared across variables; captures variable-specific noise.
Why it matters. PCA absorbs all variance into the top components, so a single high-variance noisy variable will dominate the first PC even if it shares no signal with the others. FA separates the shared (common) variance from the variable-specific (unique) variance. In psychometrics, signal-detection settings, and any case where variables have heterogeneous noise scales, FA is the right model and PCA is at best a fast approximation. Conversely, when the variables are roughly homoscedastic and you only want low-dimensional reconstruction, PCA is simpler and faster.
Sparse PCA
Classical PCA produces dense loading vectors: each PC mixes contributions from every original variable, which destroys interpretability when is large. Sparse PCA (Zou, Hastie, Tibshirani 2006) adds an penalty to the loading vector:
The penalty zeros out small loadings; the resulting PC depends on a small subset of variables, making it interpretable. The objective is non-convex; solvers use alternating gradient or semidefinite relaxations.
Why it matters. Use sparse PCA in high-dimensional settings (genomics, text, network features) where you want a small set of identifiable variables driving each component. The cost is rotational ambiguity (each sparse PC depends on the regularization path) and a non-convex optimization. Cross-validation on is standard. The penalty mechanism is the same as in lasso regression; sparse PCA is the unsupervised counterpart.
Robust PCA
Classical PCA's objective is sensitive to outliers: a single corrupted observation can rotate the top component substantially. Robust PCA (Candès, Li, Ma, Wright 2011) decomposes the data matrix as a sum of a low-rank component (the "true" signal) and a sparse component (the corruptions):
where is the nuclear norm (sum of singular values, a convex surrogate for rank) and is entry-wise. The decomposition recovers exactly under low-rank-plus-sparse identifiability conditions on .
Why it matters. Use robust PCA when outliers are gross (entry-level corruption, missing-or-corrupt entries, or rare adversarial perturbations) rather than Gaussian noise. The classic application is video background-foreground separation: the static background is the low-rank ; the moving foreground is the sparse . The low-rank component is computed via SVD inside the iterative solver, so robust PCA inherits the SVD-based intuition while replacing the simple top- truncation with a corruption-aware decomposition. Modern variants extend to streaming and tensor decompositions.
Summary table
| Variant | Replaces | Use when |
|---|---|---|
| PPCA | Deterministic eigendecomposition with a Gaussian latent-variable model | You need a generative prior, EM for missing data, or a marginal likelihood |
| Kernel PCA | Linear inner product with a kernel | Nonlinear structure dominates and is moderate |
| Factor Analysis | Isotropic noise with anisotropic noise | Heterogeneous variable-specific noise scales |
| Sparse PCA | Dense loadings with -penalized loadings | High- interpretability matters |
| Robust PCA | objective with low-rank-plus-sparse decomposition | Gross outliers (not Gaussian noise) |
Each variant is a separate research thread; the references at the end of this page point to the foundational papers.
Canonical Examples
PCA on 2D data
Suppose points in form an elliptical cloud with major axis along and minor axis along . The first PC is the major axis direction (most variance), and the second PC is the minor axis. Projecting onto just the first PC collapses the data to a line along the major axis, retaining the most variance possible in one dimension.
Common Confusions
PCA and SVD are NOT the same thing
This is the most common confusion. SVD is a matrix factorization: . PCA is a statistical procedure that involves (1) centering the data and (2) finding directions of maximum variance. PCA uses SVD (of the centered data matrix) as its computational engine, but they are different concepts. SVD does not center the data. If you run SVD on uncentered data, you do not get PCA. The eigenvalues of are , not .
PCA components are not features
The principal components are linear combinations of the original features. The first PC is , which mixes all original features. Interpreting what a principal component "means" requires examining the loadings . High-dimensional PCA components are often uninterpretable.
PCA is not every dimensionality reduction method
PCA is not the same question as LDA, NMF, t-SNE, or UMAP. PCA is linear, unsupervised, and variance/reconstruction based. LDA is supervised and uses labels to separate classes. Non-negative matrix factorization constrains factors to be nonnegative, which can give parts-based decompositions. t-SNE and UMAP are nonlinear visualization methods that emphasize local neighborhoods and may distort global distances.
Summary
- PCA finds directions of maximum variance:
- The reconstruction derivation reduces to a Rayleigh quotient
- Principal components = eigenvectors of
- Eigenvalues = variances along principal directions
- Compute via SVD of (not eigendecomposition of ) for stability
- Eckart-Young: truncated SVD is the best low-rank approximation
- Reconstruction error = sum of discarded eigenvalues
- PCA requires centering; SVD alone does not center
Exercises
Problem
Show that the first principal component direction maximizes the projected variance subject to . Use the method of Lagrange multipliers.
Problem
Prove the -component version: if , then minimizing is solved by taking the columns of to be the top eigenvectors of .
Problem
Explain why computing PCA via the eigendecomposition of is numerically inferior to computing PCA via the SVD of . What goes wrong in floating-point arithmetic?
References
Canonical:
- Jolliffe, Principal Component Analysis (2002), Chapters 1-3
- Strang, Linear Algebra and Its Applications (4th ed.), Ch. 6 (Positive Definite Matrices, SVD)
- Goodfellow, Bengio, Courville, Deep Learning (2016), Ch. 2 (Linear Algebra, PCA derivation)
- Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211-218.
- Davis, C. & Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1), 1-46.
Current:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 23
- Vershynin, High-Dimensional Probability (2018), Chapter 4 (for matrix concentration and sample PCA)
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Ch. 14 (Unsupervised Learning)
- Bishop, Pattern Recognition and Machine Learning (2006), Ch. 12 (Continuous Latent Variables)
High-Dimensional PCA:
- Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2), 295-327. arXiv:math/0309281.
- Baik, J., Ben Arous, G., & Peche, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33(5), 1643-1697. arXiv:math/0403022.
Beyond classical PCA (extension references):
- Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611-622. The PPCA paper; gives PCA the generative-model interpretation and the EM algorithm for missing data.
- Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319. Kernel PCA introduction.
- Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265-286. Sparse PCA via -penalized loadings.
- Candès, E. J., Li, X., Ma, Y., & Wright, J. (2011). Robust principal component analysis? Journal of the ACM, 58(3), 1-37. Low-rank-plus-sparse decomposition; nuclear-norm relaxation.
- Bartholomew, Knott, Moustaki, Latent Variable Models and Factor Analysis (3rd ed., 2011). The graduate textbook on FA and its relationship to PCA.
Next Topics
The natural next step from PCA:
- Random Matrix Theory Overview: what happens to PCA in high dimensions
Last reviewed: May 4, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
7- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Positive Semidefinite Matriceslayer 0A · tier 1
- Singular Value Decompositionlayer 0A · tier 1
- Tensors and Tensor Operationslayer 0A · tier 1
- Gram Matrices and Kernel Matriceslayer 1 · tier 1
Derived topics
6- Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
- Dimensionality Reduction Theorylayer 2 · tier 2
- t-SNE and UMAPlayer 2 · tier 2
- Whitening and Decorrelationlayer 2 · tier 2
- Random Matrix Theory Overviewlayer 4 · tier 2
+1 more on the derived-topics page.