Skip to main content

Modern Generalization

Gaussian Processes for Machine Learning

A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.

AdvancedTier 3StableSupporting~75 min

Why This Matters

A Gaussian process is not just a nonlinear regressor with error bars. It is one of the cleanest places in ML where Bayesian inference, kernel methods, and numerical linear algebra all meet in closed form. When the likelihood is Gaussian, you get an exact posterior over functions rather than a point fit plus an ad hoc uncertainty heuristic.

That matters whenever decisions depend on uncertainty rather than raw point accuracy alone: Bayesian optimization, active experimental design, scientific surrogate modeling, and safe control are all settings where the question is not only "what do we predict?" but also "how unsure should we be here?".

The page matters for theory too. The GP posterior mean is exactly the same point predictor as kernel ridge regression, so GPs give a Bayesian semantics to kernel methods. And in the infinite-width limit, they sit on the road toward the neural tangent kernel.

theorem visual

Prior to Posterior Function Band

Observations collapse a broad kernel prior into a posterior mean with narrow variance near data and wide variance away from it.

l = 1.0

Prior functions

Before data, the kernel defines a family of plausible smooth functions.

Posterior mean and uncertainty

The band stays tight near observed points and widens where the model has not seen data.

Posterior mean

Bridge to learning theory

The posterior mean is the same point predictor as kernel ridge regression; the GP adds uncertainty on top.

Mental Model

Imagine the space of all smooth functions from Rd\mathbb{R}^d to R\mathbb{R}. A GP defines a probability distribution over this space. Before seeing any data, you have a prior: many functions are plausible. After observing data points (xi,yi)(x_i, y_i), the posterior concentrates on functions that pass near the data. At points far from any observation, the posterior is wide (high uncertainty). Near observations, it is narrow (low uncertainty).

The kernel is the real prior object. It says which function behaviors should be considered similar before any data arrive. A short length scale says nearby inputs can still vary sharply. A long length scale says the function should move coherently over larger regions. A linear kernel says the whole prior lives in the space of affine trends. The posterior then updates that geometric prior using the observed data.

Reading the Kernel as a Prior

The kernel is not bookkeeping. It is a compact way to state what kinds of functions the model considers plausible before seeing data.

Kernel changePrior belief about functionsPosterior consequence
smaller length scale \ellfunction values decorrelate quickly in input spaceinterpolation becomes more local; uncertainty can rise sharply between points
larger signal variance σf2\sigma_f^2larger-amplitude deviations are plausible a prioriposterior bands stay wider away from observations
larger noise variance σn2\sigma_n^2observations are trusted lessposterior mean smooths through the data instead of interpolating them
linear kernelonly linear trends are allowedGP regression reduces to Bayesian linear regression
Mat'ern with smaller ν\nurougher sample paths are plausibleposterior functions can be less smooth than under RBF

Formal Setup and Notation

Definition

Gaussian Process

A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. A GP is fully specified by:

  • A mean function m(x)=E[f(x)]m(x) = \mathbb{E}[f(x)]
  • A covariance function (kernel) k(x,x)=Cov(f(x),f(x))k(x, x') = \text{Cov}(f(x), f(x'))

We write fGP(m,k)f \sim \mathcal{GP}(m, k). For any finite set of inputs X={x1,,xn}X = \{x_1, \ldots, x_n\}:

f=[f(x1),,f(xn)]N(m,K)\mathbf{f} = [f(x_1), \ldots, f(x_n)]^\top \sim \mathcal{N}(\mathbf{m}, K)

where mi=m(xi)\mathbf{m}_i = m(x_i) and Kij=k(xi,xj)K_{ij} = k(x_i, x_j).

Common choices for the kernel:

  • Squared exponential (RBF): k(x,x)=σf2exp ⁣(xx222)k(x, x') = \sigma_f^2 \exp\!\left(-\frac{\|x - x'\|^2}{2\ell^2}\right). Produces infinitely smooth functions. Length scale \ell controls how quickly correlations decay.
  • Matern kernel: k(x,x)k(x,x') with a smoothness parameter ν\nu. At ν=1/2\nu = 1/2 gives the Ornstein-Uhlenbeck process; at ν\nu \to \infty recovers the squared exponential.
  • Linear kernel: k(x,x)=xxk(x, x') = x^\top x'. Gives Bayesian linear regression with functions through the origin; to recover full Bayesian linear regression with an intercept, add a constant kernel σb2\sigma_b^2, i.e. use k(x,x)=xx+σb2k(x,x') = x^\top x' + \sigma_b^2.
  • Periodic kernel: encodes repeating structure, useful when the signal is expected to recur with a fixed period.

Core Definitions

Definition

Observation Model

We observe noisy function values:

yi=f(xi)+ϵi,ϵiN(0,σn2)y_i = f(x_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)

Given nn observations y=[y1,,yn]\mathbf{y} = [y_1, \ldots, y_n]^\top at inputs X=[x1,,xn]X = [x_1, \ldots, x_n], the joint distribution of observed values and the function value at a new test point xx_* is:

[yf]N ⁣([mm],[K+σn2Ikkk])\begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} \mathbf{m} \\ m_* \end{bmatrix}, \begin{bmatrix} K + \sigma_n^2 I & \mathbf{k}_* \\ \mathbf{k}_*^\top & k_{**} \end{bmatrix} \right)

where k=[k(x1,x),,k(xn,x)]\mathbf{k}_* = [k(x_1, x_*), \ldots, k(x_n, x_*)]^\top and k=k(x,x)k_{**} = k(x_*, x_*).

Main Theorems

Theorem

GP Posterior Predictive Distribution

Statement

Given training data (X,y)(X, \mathbf{y}) and a test input xx_*, the posterior predictive distribution is Gaussian:

fX,y,xN(fˉ,Var(f))f_* | X, \mathbf{y}, x_* \sim \mathcal{N}(\bar{f}_*, \text{Var}(f_*))

with:

fˉ=m(x)+k(K+σn2I)1(ym)\bar{f}_* = m(x_*) + \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m})

Var(f)=kk(K+σn2I)1k\text{Var}(f_*) = k_{**} - \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*

The posterior mean fˉ\bar{f}_* is the best prediction. The posterior variance Var(f)\text{Var}(f_*) quantifies uncertainty at xx_*.

For the noisy observation y=f+ϵy_* = f_* + \epsilon_*, the predictive variance is Var(y)=Var(f)+σn2\text{Var}(y_*) = \text{Var}(f_*) + \sigma_n^2.

Intuition

The posterior mean is a weighted combination of the observed values, where the weights come from the kernel similarities to the test point, adjusted by the training data correlations. The posterior variance starts at the prior variance kk_{**} and is reduced by the information provided by nearby training points. Far from any training point, Var(f)k\text{Var}(f_*) \approx k_{**} (back to the prior). Near training points, the variance shrinks.

Proof Sketch

This follows directly from the formula for conditioning in a multivariate Gaussian. If [a,b]N(μ,Σ)[a, b]^\top \sim \mathcal{N}(\mu, \Sigma) with block structure, then abN(μa+ΣabΣbb1(bμb),ΣaaΣabΣbb1Σba)a | b \sim \mathcal{N}(\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(b - \mu_b), \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}). Apply this with a=fa = f_* and b=yb = \mathbf{y}.

Why It Matters

This is one of the few cases in machine learning where you get a closed-form posterior distribution over predictions. No MCMC, no variational inference, no approximations (for Gaussian likelihood). The uncertainty estimates are exact and calibrated under the model assumptions.

Failure Mode

The posterior is exact only for Gaussian likelihoods. For classification (Bernoulli likelihood) or robust regression (heavy-tailed noise), the posterior is no longer Gaussian and you need approximations (Laplace, EP, or variational methods). Also, the uncertainty is calibrated under the model, which may not match reality if the kernel is misspecified.

Proposition

GP Posterior Mean Equals Kernel Ridge Regression

Statement

The GP posterior mean function fˉ(x)=k(x)(K+σn2I)1y\bar{f}(x) = \mathbf{k}(x)^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} is identical to the kernel ridge regression solution:

f^=argminfHk1ni=1n(yif(xi))2+λfHk2\hat{f} = \arg\min_{f \in \mathcal{H}_k} \frac{1}{n} \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \|f\|_{\mathcal{H}_k}^2

with regularization parameter λ=σn2/n\lambda = \sigma_n^2/n and RKHS norm induced by the kernel kk.

Intuition

The GP and kernel ridge regression give the same point predictions. The GP adds uncertainty quantification on top. This means every time you do kernel ridge regression, you are computing the same predictor that the GP posterior mean would produce. If your optimization objective is written without the 1/n1/n normalization in front of the squared loss, the corresponding ridge parameter is λ=σn2\lambda = \sigma_n^2 instead.

Proof Sketch

By the representer theorem, the kernel ridge regression solution has the form f^(x)=iαik(x,xi)\hat{f}(x) = \sum_i \alpha_i k(x, x_i). Setting the gradient of the regularized objective to zero gives α=(K+nλI)1y\alpha = (K + n\lambda I)^{-1} \mathbf{y}. With λ=σn2/n\lambda = \sigma_n^2/n, this is (K+σn2I)1y(K + \sigma_n^2 I)^{-1} \mathbf{y}, matching the GP posterior mean weights.

Why It Matters

This connection is one of the most important bridges in ML theory. It unifies the frequentist (regularization) and Bayesian (GP prior) perspectives. It also means that theoretical results for kernel methods (like generalization bounds) apply to the GP posterior mean. The Bayesian uncertainty from a GP, however, is not a property of kernel ridge regression: KRR returns only a point predictor with no posterior. The predictive variance comes from the GP prior plus the Gaussian likelihood, and recovering it from the regularized optimization view requires an extra Bayesian assumption beyond what KRR uses.

Failure Mode

The equivalence holds for the posterior mean only. Kernel ridge regression gives no uncertainty estimates. If you need error bars, you need the full GP posterior, not just the mean.

Proposition

Log Marginal Likelihood Balances Fit and Complexity

Statement

Let Ky=K+σn2IK_y = K + \sigma_n^2 I. Then the log marginal likelihood of the observed targets is

logp(yX)=12yKy1y12logKyn2log(2π).\log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi).

The first term is a data-fit term in the Mahalanobis geometry induced by the kernel. The second term is a complexity penalty coming from the volume of the function family supported by the prior.

Intuition

This is the GP version of Occam's razor. A kernel/hyperparameter choice is rewarded when it fits the observed targets, but it is penalized when it makes a large volume of functions plausible for no payoff in fit. The log determinant is not "number of parameters"; it is a function-space complexity term.

Proof Sketch

Under the Gaussian prior and Gaussian observation model, the observed targets have distribution yN(0,Ky)\mathbf{y} \sim \mathcal{N}(\mathbf{0}, K_y) after integrating out the latent function values. Taking the log density of that multivariate normal yields the formula.

Why It Matters

This gives a principled objective for choosing kernel hyperparameters. Unlike cross-validation, it falls directly out of the probabilistic model and can be differentiated exactly.

Failure Mode

Maximizing marginal likelihood only chooses the best model inside the chosen kernel family. If the kernel is misspecified, the evidence can still pick a bad length scale or noise level with great confidence.

Marginal Likelihood for Hyperparameter Selection

The kernel has hyperparameters (length scale \ell, signal variance σf2\sigma_f^2, noise variance σn2\sigma_n^2). The GP framework provides a principled way to set them: maximize the marginal likelihood (also called the evidence):

logp(yX)=12y(K+σn2I)1y12logK+σn2In2log(2π)\log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2} \log |K + \sigma_n^2 I| - \frac{n}{2} \log(2\pi)

The first term penalizes data misfit. The second term penalizes model complexity (it is the log-determinant of the covariance matrix, which grows with the number of effective parameters). This automatic Occam's razor is one of the most appealing features of GPs.

For implementation, the useful gradient formula is

ηlogp(yX)=12αKyηα12tr ⁣(Ky1Kyη),α=Ky1y.\frac{\partial}{\partial \eta}\log p(\mathbf{y}|X) = \frac{1}{2}\alpha^\top \frac{\partial K_y}{\partial \eta}\alpha - \frac{1}{2}\operatorname{tr}\!\left(K_y^{-1}\frac{\partial K_y}{\partial \eta}\right), \qquad \alpha = K_y^{-1}\mathbf{y}.

The first term rewards hyperparameter changes that improve fit to the observed targets. The trace term penalizes changes that make the covariance geometry too complex.

Numerical Reality: Solve, Don't Invert

In formulas, GP inference is written with (K+σn2I)1(K + \sigma_n^2 I)^{-1}. In actual code, you should almost never form that inverse explicitly.

Let Ky=K+σn2I+εIK_y = K + \sigma_n^2 I + \varepsilon I, where εI\varepsilon I is a tiny numerical jitter term used only to stabilize factorization. Then the standard exact pipeline is:

  1. Compute a Cholesky factorization Ky=LLK_y = LL^\top.
  2. Solve Lz=yLz = \mathbf{y} and then Lα=zL^\top \alpha = z.
  3. Use α\alpha to evaluate the posterior mean fˉ(x)=kα\bar f(x_*) = \mathbf{k}_*^\top \alpha.
  4. Reuse triangular solves to compute predictive variances and the log determinant via logKy=2ilogLii\log |K_y| = 2\sum_i \log L_{ii}.

This matters because the hard part of a GP is really a numerical linear algebra problem on the kernel matrix, not a symbolic Bayesian derivation. When the system is ill-conditioned, the issue is often duplicated inputs, a pathological length scale, or an overconfident noise level rather than a failure of GP theory itself.

Computational Cost

The bottleneck is factorizing the n×nn \times n matrix KyK_y:

  • Exact inference: O(n3)O(n^3) time and O(n2)O(n^2) memory. This limits exact GPs to roughly n10,000n \leq 10{,}000 on modern hardware.
  • Sparse approximations: inducing-point methods reduce cost to O(nm2)O(nm^2) where mnm \ll n is the number of inducing points.
  • Structured kernels: stationary kernels on grids can exploit Toeplitz or Kronecker structure for much faster solves.
  • Iterative methods: for large systems, conjugate gradient methods can replace dense direct solves when matrix-vector products are cheap.

Canonical Examples

Example

GP regression on 1D data

Observe 10 noisy points from f(x)=sin(x)f(x) = \sin(x) on [0,2π][0, 2\pi]. Using a squared exponential kernel, the GP posterior mean closely tracks the sine curve where data is dense and reverts to the prior mean (zero) where data is sparse. The 95% credible interval (the error bars) is narrow near observations and wide in data-sparse regions. This is the textbook illustration of GP uncertainty quantification.

Common Confusions

Watch Out

GPs are nonparametric, but they have hyperparameters

A GP is nonparametric in the sense that the number of effective parameters grows with the data (the predictive function depends on all nn training points). But the kernel has a fixed, small number of hyperparameters. These control the shape of the prior over functions (smoothness, length scale, amplitude), not individual function values. Tuning them via marginal likelihood is model selection, not parameter estimation.

Watch Out

Latent uncertainty is not predictive noise

The posterior variance Var(f)\text{Var}(f_*) is the uncertainty about the latent function value. If you want uncertainty for a new noisy observation yy_*, you must add the observation noise: Var(y)=Var(f)+σn2\text{Var}(y_*) = \text{Var}(f_*) + \sigma_n^2. Many plots silently switch between these two quantities; the distinction matters when you interpret error bars.

Watch Out

Jitter is not the observation noise

The observation noise σn2\sigma_n^2 is part of the statistical model. Jitter εI\varepsilon I is a tiny numerical stabilizer added so the matrix factorization does not blow up on nearly singular kernel matrices. Setting a huge jitter is not the same as modeling noisy data; it changes the numerics without giving a coherent probabilistic interpretation.

Watch Out

A GP does not necessarily interpolate the training labels

Exact interpolation happens only in the zero-noise model. With σn2>0\sigma_n^2 > 0, the posterior mean is a smoothed compromise between the prior geometry and the observations. This is often what you want: fitting noise exactly is not a sign of Bayesian virtue.

Summary

  • A GP is a distribution over functions: fGP(m,k)f \sim \mathcal{GP}(m, k)
  • Posterior is closed-form Gaussian for Gaussian likelihood
  • Posterior mean =k(K+σn2I)1y= \mathbf{k}_*^\top(K + \sigma_n^2 I)^{-1}\mathbf{y} equals kernel ridge regression
  • Posterior variance gives calibrated uncertainty (wide far from data, narrow near data)
  • Marginal likelihood provides automatic hyperparameter selection with built-in Occam's razor
  • In practice, exact inference is a Cholesky-solve problem, not an explicit matrix inverse
  • Main limitation: O(n3)O(n^3) exact inference cost

Exercises

ExerciseCore

Problem

You have a GP with zero mean and squared exponential kernel with =1\ell = 1, σf2=1\sigma_f^2 = 1, and σn2=0.1\sigma_n^2 = 0.1. You observe a single point (x1,y1)=(0,1)(x_1, y_1) = (0, 1). What is the posterior mean and variance at x=0x_* = 0? At x=10x_* = 10?

ExerciseAdvanced

Problem

Prove that the GP posterior variance Var(f)=kk(K+σn2I)1k\text{Var}(f_*) = k_{**} - \mathbf{k}_*^\top(K + \sigma_n^2 I)^{-1}\mathbf{k}_* is always non-negative. Why is this not obvious from the formula?

ExerciseResearch

Problem

Let Ky(η)K_y(\eta) depend on a scalar hyperparameter η\eta (for example a length scale or noise variance), and define α=Ky1y\alpha = K_y^{-1}\mathbf{y}. Prove that

ηlogp(yX)=12αKyηα12tr ⁣(Ky1Kyη).\frac{\partial}{\partial \eta}\log p(\mathbf{y}|X) = \frac{1}{2}\alpha^\top \frac{\partial K_y}{\partial \eta}\alpha - \frac{1}{2}\operatorname{tr}\!\left(K_y^{-1}\frac{\partial K_y}{\partial \eta}\right).

Interpret the two terms.

References

Canonical:

  • Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Chapters 1-5
  • Williams & Rasmussen, "Gaussian Processes for Regression" (NeurIPS 1996)
  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 15
  • Bishop, Pattern Recognition and Machine Learning (2006), Section 6.4
  • Neal, Bayesian Learning for Neural Networks (1996), Chapter 2

Scalability and structure:

  • Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
  • Hensman, Fusi, Lawrence, "Gaussian processes for big data" (UAI 2013)
  • Wilson & Nickisch, "Kernel interpolation for scalable structured GPs" (ICML 2015)

Next Topics

The natural next step from Gaussian processes:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

10