Modern Generalization
Gaussian Processes for Machine Learning
A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.
Prerequisites
Why This Matters
A Gaussian process is not just a nonlinear regressor with error bars. It is one of the cleanest places in ML where Bayesian inference, kernel methods, and numerical linear algebra all meet in closed form. When the likelihood is Gaussian, you get an exact posterior over functions rather than a point fit plus an ad hoc uncertainty heuristic.
That matters whenever decisions depend on uncertainty rather than raw point accuracy alone: Bayesian optimization, active experimental design, scientific surrogate modeling, and safe control are all settings where the question is not only "what do we predict?" but also "how unsure should we be here?".
The page matters for theory too. The GP posterior mean is exactly the same point predictor as kernel ridge regression, so GPs give a Bayesian semantics to kernel methods. And in the infinite-width limit, they sit on the road toward the neural tangent kernel.
Mental Model
Imagine the space of all smooth functions from to . A GP defines a probability distribution over this space. Before seeing any data, you have a prior: many functions are plausible. After observing data points , the posterior concentrates on functions that pass near the data. At points far from any observation, the posterior is wide (high uncertainty). Near observations, it is narrow (low uncertainty).
The kernel is the real prior object. It says which function behaviors should be considered similar before any data arrive. A short length scale says nearby inputs can still vary sharply. A long length scale says the function should move coherently over larger regions. A linear kernel says the whole prior lives in the space of affine trends. The posterior then updates that geometric prior using the observed data.
Reading the Kernel as a Prior
The kernel is not bookkeeping. It is a compact way to state what kinds of functions the model considers plausible before seeing data.
| Kernel change | Prior belief about functions | Posterior consequence |
|---|---|---|
| smaller length scale | function values decorrelate quickly in input space | interpolation becomes more local; uncertainty can rise sharply between points |
| larger signal variance | larger-amplitude deviations are plausible a priori | posterior bands stay wider away from observations |
| larger noise variance | observations are trusted less | posterior mean smooths through the data instead of interpolating them |
| linear kernel | only linear trends are allowed | GP regression reduces to Bayesian linear regression |
| Mat'ern with smaller | rougher sample paths are plausible | posterior functions can be less smooth than under RBF |
Formal Setup and Notation
Gaussian Process
A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. A GP is fully specified by:
- A mean function
- A covariance function (kernel)
We write . For any finite set of inputs :
where and .
Common choices for the kernel:
- Squared exponential (RBF): . Produces infinitely smooth functions. Length scale controls how quickly correlations decay.
- Matern kernel: with a smoothness parameter . At gives the Ornstein-Uhlenbeck process; at recovers the squared exponential.
- Linear kernel: . Gives Bayesian linear regression with functions through the origin; to recover full Bayesian linear regression with an intercept, add a constant kernel , i.e. use .
- Periodic kernel: encodes repeating structure, useful when the signal is expected to recur with a fixed period.
Core Definitions
Observation Model
We observe noisy function values:
Given observations at inputs , the joint distribution of observed values and the function value at a new test point is:
where and .
Main Theorems
GP Posterior Predictive Distribution
Statement
Given training data and a test input , the posterior predictive distribution is Gaussian:
with:
The posterior mean is the best prediction. The posterior variance quantifies uncertainty at .
For the noisy observation , the predictive variance is .
Intuition
The posterior mean is a weighted combination of the observed values, where the weights come from the kernel similarities to the test point, adjusted by the training data correlations. The posterior variance starts at the prior variance and is reduced by the information provided by nearby training points. Far from any training point, (back to the prior). Near training points, the variance shrinks.
Proof Sketch
This follows directly from the formula for conditioning in a multivariate Gaussian. If with block structure, then . Apply this with and .
Why It Matters
This is one of the few cases in machine learning where you get a closed-form posterior distribution over predictions. No MCMC, no variational inference, no approximations (for Gaussian likelihood). The uncertainty estimates are exact and calibrated under the model assumptions.
Failure Mode
The posterior is exact only for Gaussian likelihoods. For classification (Bernoulli likelihood) or robust regression (heavy-tailed noise), the posterior is no longer Gaussian and you need approximations (Laplace, EP, or variational methods). Also, the uncertainty is calibrated under the model, which may not match reality if the kernel is misspecified.
GP Posterior Mean Equals Kernel Ridge Regression
Statement
The GP posterior mean function is identical to the kernel ridge regression solution:
with regularization parameter and RKHS norm induced by the kernel .
Intuition
The GP and kernel ridge regression give the same point predictions. The GP adds uncertainty quantification on top. This means every time you do kernel ridge regression, you are computing the same predictor that the GP posterior mean would produce. If your optimization objective is written without the normalization in front of the squared loss, the corresponding ridge parameter is instead.
Proof Sketch
By the representer theorem, the kernel ridge regression solution has the form . Setting the gradient of the regularized objective to zero gives . With , this is , matching the GP posterior mean weights.
Why It Matters
This connection is one of the most important bridges in ML theory. It unifies the frequentist (regularization) and Bayesian (GP prior) perspectives. It also means that theoretical results for kernel methods (like generalization bounds) apply to the GP posterior mean. The Bayesian uncertainty from a GP, however, is not a property of kernel ridge regression: KRR returns only a point predictor with no posterior. The predictive variance comes from the GP prior plus the Gaussian likelihood, and recovering it from the regularized optimization view requires an extra Bayesian assumption beyond what KRR uses.
Failure Mode
The equivalence holds for the posterior mean only. Kernel ridge regression gives no uncertainty estimates. If you need error bars, you need the full GP posterior, not just the mean.
Log Marginal Likelihood Balances Fit and Complexity
Statement
Let . Then the log marginal likelihood of the observed targets is
The first term is a data-fit term in the Mahalanobis geometry induced by the kernel. The second term is a complexity penalty coming from the volume of the function family supported by the prior.
Intuition
This is the GP version of Occam's razor. A kernel/hyperparameter choice is rewarded when it fits the observed targets, but it is penalized when it makes a large volume of functions plausible for no payoff in fit. The log determinant is not "number of parameters"; it is a function-space complexity term.
Proof Sketch
Under the Gaussian prior and Gaussian observation model, the observed targets have distribution after integrating out the latent function values. Taking the log density of that multivariate normal yields the formula.
Why It Matters
This gives a principled objective for choosing kernel hyperparameters. Unlike cross-validation, it falls directly out of the probabilistic model and can be differentiated exactly.
Failure Mode
Maximizing marginal likelihood only chooses the best model inside the chosen kernel family. If the kernel is misspecified, the evidence can still pick a bad length scale or noise level with great confidence.
Marginal Likelihood for Hyperparameter Selection
The kernel has hyperparameters (length scale , signal variance , noise variance ). The GP framework provides a principled way to set them: maximize the marginal likelihood (also called the evidence):
The first term penalizes data misfit. The second term penalizes model complexity (it is the log-determinant of the covariance matrix, which grows with the number of effective parameters). This automatic Occam's razor is one of the most appealing features of GPs.
For implementation, the useful gradient formula is
The first term rewards hyperparameter changes that improve fit to the observed targets. The trace term penalizes changes that make the covariance geometry too complex.
Numerical Reality: Solve, Don't Invert
In formulas, GP inference is written with . In actual code, you should almost never form that inverse explicitly.
Let , where is a tiny numerical jitter term used only to stabilize factorization. Then the standard exact pipeline is:
- Compute a Cholesky factorization .
- Solve and then .
- Use to evaluate the posterior mean .
- Reuse triangular solves to compute predictive variances and the log determinant via .
This matters because the hard part of a GP is really a numerical linear algebra problem on the kernel matrix, not a symbolic Bayesian derivation. When the system is ill-conditioned, the issue is often duplicated inputs, a pathological length scale, or an overconfident noise level rather than a failure of GP theory itself.
Computational Cost
The bottleneck is factorizing the matrix :
- Exact inference: time and memory. This limits exact GPs to roughly on modern hardware.
- Sparse approximations: inducing-point methods reduce cost to where is the number of inducing points.
- Structured kernels: stationary kernels on grids can exploit Toeplitz or Kronecker structure for much faster solves.
- Iterative methods: for large systems, conjugate gradient methods can replace dense direct solves when matrix-vector products are cheap.
Canonical Examples
GP regression on 1D data
Observe 10 noisy points from on . Using a squared exponential kernel, the GP posterior mean closely tracks the sine curve where data is dense and reverts to the prior mean (zero) where data is sparse. The 95% credible interval (the error bars) is narrow near observations and wide in data-sparse regions. This is the textbook illustration of GP uncertainty quantification.
Common Confusions
GPs are nonparametric, but they have hyperparameters
A GP is nonparametric in the sense that the number of effective parameters grows with the data (the predictive function depends on all training points). But the kernel has a fixed, small number of hyperparameters. These control the shape of the prior over functions (smoothness, length scale, amplitude), not individual function values. Tuning them via marginal likelihood is model selection, not parameter estimation.
Latent uncertainty is not predictive noise
The posterior variance is the uncertainty about the latent function value. If you want uncertainty for a new noisy observation , you must add the observation noise: . Many plots silently switch between these two quantities; the distinction matters when you interpret error bars.
Jitter is not the observation noise
The observation noise is part of the statistical model. Jitter is a tiny numerical stabilizer added so the matrix factorization does not blow up on nearly singular kernel matrices. Setting a huge jitter is not the same as modeling noisy data; it changes the numerics without giving a coherent probabilistic interpretation.
A GP does not necessarily interpolate the training labels
Exact interpolation happens only in the zero-noise model. With , the posterior mean is a smoothed compromise between the prior geometry and the observations. This is often what you want: fitting noise exactly is not a sign of Bayesian virtue.
Summary
- A GP is a distribution over functions:
- Posterior is closed-form Gaussian for Gaussian likelihood
- Posterior mean equals kernel ridge regression
- Posterior variance gives calibrated uncertainty (wide far from data, narrow near data)
- Marginal likelihood provides automatic hyperparameter selection with built-in Occam's razor
- In practice, exact inference is a Cholesky-solve problem, not an explicit matrix inverse
- Main limitation: exact inference cost
Exercises
Problem
You have a GP with zero mean and squared exponential kernel with , , and . You observe a single point . What is the posterior mean and variance at ? At ?
Problem
Prove that the GP posterior variance is always non-negative. Why is this not obvious from the formula?
Problem
Let depend on a scalar hyperparameter (for example a length scale or noise variance), and define . Prove that
Interpret the two terms.
References
Canonical:
- Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Chapters 1-5
- Williams & Rasmussen, "Gaussian Processes for Regression" (NeurIPS 1996)
- Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 15
- Bishop, Pattern Recognition and Machine Learning (2006), Section 6.4
- Neal, Bayesian Learning for Neural Networks (1996), Chapter 2
Scalability and structure:
- Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
- Hensman, Fusi, Lawrence, "Gaussian processes for big data" (UAI 2013)
- Wilson & Nickisch, "Kernel interpolation for scalable structured GPs" (ICML 2015)
Next Topics
The natural next step from Gaussian processes:
- Bayesian Optimization for Hyperparameters: using GP posteriors to trade off exploration and exploitation
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
10- Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
- Conjugate Priorslayer 0B · tier 1
- The Multivariate Normal Distributionlayer 0B · tier 1
- Gram Matrices and Kernel Matriceslayer 1 · tier 1
- Ridge Regressionlayer 1 · tier 1
Derived topics
6- Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
- Bayesian Optimization for Hyperparameterslayer 3 · tier 2
- Gaussian Process Regressionlayer 3 · tier 2
- Bayesian Neural Networkslayer 3 · tier 3
- Gaussian Processes in Astronomylayer 4 · tier 3
+1 more on the derived-topics page.