Skip to main content

ML Methods

Gaussian Process Regression

Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.

AdvancedTier 2StableSupporting~55 min

Why This Matters

Gaussian process regression gives you two things that most regression methods do not: a posterior mean (point prediction) and a posterior variance (uncertainty estimate) at every test point, both in closed form. The uncertainty is calibrated: it is large where training data is sparse and small where data is dense. This is a Bayesian approach: place a prior over functions, condition on data, and read off the posterior.

This built-in uncertainty quantification makes GPs the standard surrogate model in Bayesian optimization, where you need to decide where to evaluate an expensive function next. The acquisition function depends on both the predicted value and the uncertainty, and GPs provide both.

Formal Setup

Assume we observe nn training points (X,y)(X, \mathbf{y}) where X=[x1,,xn]X = [x_1, \ldots, x_n]^\top and y=[y1,,yn]\mathbf{y} = [y_1, \ldots, y_n]^\top with yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i and εiN(0,σn2)\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2).

Place a GP prior on ff: fGP(0,k(x,x))f \sim \mathcal{GP}(0, k(x, x')) where kk is the kernel (covariance function).

Definition

Kernel Matrix

The kernel matrix (Gram matrix) is KRn×nK \in \mathbb{R}^{n \times n} with Kij=k(xi,xj)K_{ij} = k(x_i, x_j). For a test point xx_*, define the vector k=[k(x,x1),,k(x,xn)]Rn\mathbf{k}_* = [k(x_*, x_1), \ldots, k(x_*, x_n)]^\top \in \mathbb{R}^n and scalar k=k(x,x)k_{**} = k(x_*, x_*).

Posterior and Marginal Likelihood

The posterior predictive distribution and the log marginal likelihood for GP regression are derived in the canonical Gaussian Processes for ML page; both follow from conditioning a multivariate Gaussian. The closed-form posterior mean is fˉ=k(K+σn2I)1y\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}, the posterior variance is kk(K+σn2I)1kk_{**} - \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*, and the log marginal likelihood combines a data-fit term, a log-determinant complexity term, and a normalization constant.

This page is a special case of Gaussian processes for ML focused on regression with Gaussian likelihood. The classical results are stated there; this page covers the kriging / RBF connection, sparse approximations, and the numerical pipeline.

Kernel Choice

The kernel encodes prior assumptions about the function ff:

  • Squared exponential (RBF): k(x,x)=σf2exp(xx2/(22))k(x, x') = \sigma_f^2 \exp(-\|x - x'\|^2 / (2\ell^2)). Assumes ff is infinitely differentiable. The length scale \ell controls smoothness.
  • Matern-ν\nu: k(x,x)=σf221νΓ(ν)(2νxx)νKν(2νxx)k(x, x') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu} \|x-x'\|}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\|x-x'\|}{\ell}\right). Controls differentiability via ν\nu: ν=1/2\nu = 1/2 gives Ornstein-Uhlenbeck (rough), ν=3/2\nu = 3/2 gives once-differentiable, ν\nu \to \infty gives the RBF.
  • Periodic: for functions with known periodicity.
  • Sums and products of kernels are valid kernels, enabling compositional modeling.

Marginal Likelihood for Hyperparameter Selection

For the explicit log marginal likelihood and its proof, see the Gaussian processes for ML page. In practice the marginal likelihood is non-convex in the hyperparameters and can have multiple local optima, especially with compositional kernels — different initializations can lead to different solutions.

Sparse Gaussian Processes

Exact GP regression costs O(n3)O(n^3). Sparse GPs reduce this by selecting mnm \ll n inducing points Z={z1,,zm}Z = \{z_1, \ldots, z_m\} and approximating the full GP through these points.

The variational free energy (VFE, Titsias 2009) approach optimizes the inducing point locations and a variational distribution q(u)q(\mathbf{u}) over the function values at the inducing points. The resulting cost is O(nm2)O(nm^2) for training and O(m2)O(m^2) for prediction.

The FITC (fully independent training conditional) approximation assumes conditional independence of training points given the inducing points. This gives a diagonal correction to the inducing point approximation.

For m=500m = 500 inducing points, sparse GPs can handle n=100,000n = 100{,}000 training points. The choice of inducing point locations matters: placing them in regions of high data density gives better approximations.

Connection to Bayesian Optimization

Bayesian optimization uses a GP as a surrogate for an expensive black-box function ff. The GP posterior mean and variance define an acquisition function (e.g., expected improvement, UCB) that balances exploitation (evaluating where the mean is low) and exploration (evaluating where the variance is high).

The GP's calibrated uncertainty is critical here. If the uncertainty is underestimated, the optimizer exploits too early. If overestimated, it explores too much. GP uncertainty, conditioned on kernel hyperparameters fit via marginal likelihood, is well-calibrated for smooth functions.

Common Confusions

Watch Out

GP uncertainty is conditional on the kernel being correct

The posterior variance is exact given the GP prior. If the true function is not well-modeled by the chosen kernel (e.g., using an RBF kernel for a discontinuous function), the uncertainty estimates are miscalibrated. The GP does not "know" when its kernel assumption is wrong.

Watch Out

The posterior mean equals kernel ridge regression, but the variance does not

The GP posterior mean fˉ=k(K+σn2I)1y\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} is identical to the kernel ridge regression prediction with regularization λ=σn2/n\lambda = \sigma_n^2 / n when the squared loss carries the standard 1/n1/n normalization (and λ=σn2\lambda = \sigma_n^2 when it does not). Kernel ridge regression does not produce the posterior variance. The uncertainty quantification is unique to the Bayesian (GP) formulation; see the GP-for-ML page for the equivalence proof.

Canonical Examples

Example

GP regression on a 1D function

Sample 20 points from f(x)=sin(3x)f(x) = \sin(3x) on [0,2π][0, 2\pi] with noise σn=0.2\sigma_n = 0.2. Fit a GP with RBF kernel, optimizing \ell and σf\sigma_f via marginal likelihood. The posterior mean tracks the true sine function. The 95% confidence band (±2\pm 2 standard deviations) is narrow near training points and widens in gaps between points. At the edges of the training range (x<0x < 0 or x>2πx > 2\pi), the posterior reverts to the prior mean (zero) with prior variance, correctly reflecting that the GP has no information there.

Summary

  • GP regression gives a closed-form Gaussian posterior: fˉ=k(K+σn2I)1y\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}
  • Posterior variance is large where data is sparse and small near training points
  • Kernel choice encodes smoothness and structure assumptions about ff
  • Marginal likelihood selects hyperparameters without cross-validation via automatic Occam's razor
  • Exact GP costs O(n3)O(n^3); sparse GPs with mm inducing points cost O(nm2)O(nm^2)
  • GPs are the default surrogate model for Bayesian optimization

Exercises

ExerciseCore

Problem

For a GP with RBF kernel (=1\ell = 1, σf2=1\sigma_f^2 = 1) and noise σn2=0.1\sigma_n^2 = 0.1, you observe one training point: x1=0x_1 = 0, y1=1y_1 = 1. Compute the posterior mean and variance at x=0x_* = 0 and at x=3x_* = 3.

ExerciseAdvanced

Problem

Show that the GP marginal likelihood penalizes both underfitting and overfitting. Specifically, consider the RBF kernel with length scale \ell. Explain what happens to the data-fit term and the complexity term as 0\ell \to 0 and as \ell \to \infty.

Connection to Kriging and RBF Interpolation

GP regression is mathematically identical to kriging, the geostatistical interpolation method developed by Matheron (1963) building on Krige's mining work.

In kriging, the kernel is called the variogram (or its complement, the covariance function), the posterior mean is the kriging predictor, and the posterior variance is the kriging variance. The equations are the same: f^(x)=k(K+σn2I)1y\hat{f}(x_*) = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}. The geostatistical and machine learning communities developed the same theory independently.

Radial basis function (RBF) interpolation is the noise-free special case: set σn2=0\sigma_n^2 = 0, and the GP posterior mean becomes f^(x)=kK1y\hat{f}(x_*) = \mathbf{k}_*^\top K^{-1} \mathbf{y}, which passes exactly through every training point. This is classical RBF interpolation with the kernel k(x,x)k(x, x') serving as the radial basis function. Common choices include the Gaussian (exp(r2/22)\exp(-r^2/2\ell^2)), multiquadric (1+r2/2\sqrt{1 + r^2/\ell^2}), and Matern kernels.

The GP perspective adds two things that classical RBF interpolation lacks: principled uncertainty quantification (the posterior variance) and hyperparameter selection via marginal likelihood.

Proposition

Noise-Free GP Equals RBF Interpolation

Statement

With σn2=0\sigma_n^2 = 0 the GP posterior mean reduces to f^(x)=kK1y\hat{f}(x_*) = \mathbf{k}_*^\top K^{-1} \mathbf{y}, and this predictor satisfies f^(xi)=yi\hat{f}(x_i) = y_i for every training input xix_i. The predictor coincides with the classical RBF interpolant with radial basis function k(,)k(\cdot, \cdot).

Intuition

Without observation noise the model has no reason to smooth past the data. The kernel inverse acts as the change of basis from data values to interpolation weights, and the linear combination of basis functions k(,xi)k(\cdot, x_i) centered at training inputs reproduces the observed values exactly.

Proof Sketch

Apply the GP posterior mean formula at x=xjx_* = x_j. The vector k\mathbf{k}_* becomes the jj-th row of KK, so kK1y=ejy=yj\mathbf{k}_*^\top K^{-1} \mathbf{y} = \mathbf{e}_j^\top \mathbf{y} = y_j. By construction the predictor is a finite linear combination of k(,xi)k(\cdot, x_i), which is the form of the RBF interpolant; uniqueness of the interpolant in span{k(,xi)}\operatorname{span}\{k(\cdot, x_i)\} for positive definite KK identifies the two predictors.

Why It Matters

The kriging / RBF correspondence makes the geostatistics and approximation-theory literatures directly importable into GPs. Convergence rates from RBF approximation theory transfer over.

Failure Mode

With σn2=0\sigma_n^2 = 0 the matrix KK becomes ill-conditioned when training inputs cluster, and the explicit K1K^{-1} amplifies numerical noise. In practice a small jitter εI\varepsilon I is added for stability; this is a numerical fix, not a model of observation noise.

References

Canonical:

Kriging and RBF:

  • Matheron, "Principles of Geostatistics" (Economic Geology, 1963). Founding paper of kriging theory.
  • Cressie, Statistics for Spatial Data (Rev. ed., 1993), Chapters 3-4
  • Buhmann, Radial Basis Functions: Theory and Implementations (2003), Chapters 1-3
  • Wendland, Scattered Data Approximation (2005). Native spaces and convergence theory for RBF interpolation.

Current:

  • Snelson & Ghahramani, "Sparse Gaussian Processes using Pseudo-inputs" (NeurIPS 2006). The FITC approximation named in the sparse-GP section.
  • Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
  • Hensman, Fusi, Lawrence, "Gaussian Processes for Big Data" (UAI 2013)
  • Wilson & Adams, "Gaussian Process Kernels for Pattern Discovery and Extrapolation" (ICML 2013, arXiv:1302.4245). Spectral mixture kernels: structured kernel learning from data.
  • Wilson, Hu, Salakhutdinov, Xing, "Deep Kernel Learning" (AISTATS 2016, arXiv:1511.02222). Composing neural-network feature extractors with GP output layers.

Next Topics

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.