Gaussian Process Regression

Sneiderman, Robby

ML Methods

Gaussian Process Regression

Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Kernels and Rkhs Joint Marginal Conditional Distributions Ridge Regression Gram Matrices and Kernel Matrices

Quiz (2)Prereq Map

Why This Matters

Gaussian process regression gives you two things that most regression methods do not: a posterior mean (point prediction) and a posterior variance (uncertainty estimate) at every test point, both in closed form. The uncertainty is calibrated: it is large where training data is sparse and small where data is dense. This is a Bayesian approach: place a prior over functions, condition on data, and read off the posterior.

This built-in uncertainty quantification makes GPs the standard surrogate model in Bayesian optimization, where you need to decide where to evaluate an expensive function next. The acquisition function depends on both the predicted value and the uncertainty, and GPs provide both.

Formal Setup

Assume we observe $n$ training points $(X, \mathbf{y})$ where $X = [x_1, \ldots, x_n]^\top$ and $\mathbf{y} = [y_1, \ldots, y_n]^\top$ with $y_i = f(x_i) + \varepsilon_i$ and $\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)$ .

Place a GP prior on $f$ : $f \sim \mathcal{GP}(0, k(x, x'))$ where $k$ is the kernel (covariance function).

Definition

Kernel Matrix $K$

The kernel matrix (Gram matrix) is $K \in \mathbb{R}^{n \times n}$ with $K_{ij} = k(x_i, x_j)$ . For a test point $x_*$ , define the vector $\mathbf{k}_* = [k(x_*, x_1), \ldots, k(x_*, x_n)]^\top \in \mathbb{R}^n$ and scalar $k_{**} = k(x_*, x_*)$ .

Posterior and Marginal Likelihood

The posterior predictive distribution and the log marginal likelihood for GP regression are derived in the canonical Gaussian Processes for ML page; both follow from conditioning a multivariate Gaussian. The closed-form posterior mean is $\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$ , the posterior variance is $k_{**} - \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$ , and the log marginal likelihood combines a data-fit term, a log-determinant complexity term, and a normalization constant.

This page is a special case of Gaussian processes for ML focused on regression with Gaussian likelihood. The classical results are stated there; this page covers the kriging / RBF connection, sparse approximations, and the numerical pipeline.

Kernel Choice

The kernel encodes prior assumptions about the function $f$ :

Squared exponential (RBF): $k(x, x') = \sigma_f^2 \exp(-\|x - x'\|^2 / (2\ell^2))$ . Assumes $f$ is infinitely differentiable. The length scale $\ell$ controls smoothness.
Matern- $\nu$ : $k(x, x') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu} \|x-x'\|}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\|x-x'\|}{\ell}\right)$ . Controls differentiability via $\nu$ : $\nu = 1/2$ gives Ornstein-Uhlenbeck (rough), $\nu = 3/2$ gives once-differentiable, $\nu \to \infty$ gives the RBF.
Periodic: for functions with known periodicity.
Sums and products of kernels are valid kernels, enabling compositional modeling.

Marginal Likelihood for Hyperparameter Selection

For the explicit log marginal likelihood and its proof, see the Gaussian processes for ML page. In practice the marginal likelihood is non-convex in the hyperparameters and can have multiple local optima, especially with compositional kernels — different initializations can lead to different solutions.

Sparse Gaussian Processes

Exact GP regression costs $O(n^3)$ . Sparse GPs reduce this by selecting $m \ll n$ inducing points $Z = \{z_1, \ldots, z_m\}$ and approximating the full GP through these points.

The variational free energy (VFE, Titsias 2009) approach optimizes the inducing point locations and a variational distribution $q(\mathbf{u})$ over the function values at the inducing points. The resulting cost is $O(nm^2)$ for training and $O(m^2)$ for prediction.

The FITC (fully independent training conditional) approximation assumes conditional independence of training points given the inducing points. This gives a diagonal correction to the inducing point approximation.

For $m = 500$ inducing points, sparse GPs can handle $n = 100{,}000$ training points. The choice of inducing point locations matters: placing them in regions of high data density gives better approximations.

Connection to Bayesian Optimization

Bayesian optimization uses a GP as a surrogate for an expensive black-box function $f$ . The GP posterior mean and variance define an acquisition function (e.g., expected improvement, UCB) that balances exploitation (evaluating where the mean is low) and exploration (evaluating where the variance is high).

The GP's calibrated uncertainty is critical here. If the uncertainty is underestimated, the optimizer exploits too early. If overestimated, it explores too much. GP uncertainty, conditioned on kernel hyperparameters fit via marginal likelihood, is well-calibrated for smooth functions.

Common Confusions

Watch Out

GP uncertainty is conditional on the kernel being correct

The posterior variance is exact given the GP prior. If the true function is not well-modeled by the chosen kernel (e.g., using an RBF kernel for a discontinuous function), the uncertainty estimates are miscalibrated. The GP does not "know" when its kernel assumption is wrong.

Watch Out

The posterior mean equals kernel ridge regression, but the variance does not

The GP posterior mean $\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$ is identical to the kernel ridge regression prediction with regularization $\lambda = \sigma_n^2 / n$ when the squared loss carries the standard $1/n$ normalization (and $\lambda = \sigma_n^2$ when it does not). Kernel ridge regression does not produce the posterior variance. The uncertainty quantification is unique to the Bayesian (GP) formulation; see the GP-for-ML page for the equivalence proof.

Canonical Examples

Example

GP regression on a 1D function

Sample 20 points from $f(x) = \sin(3x)$ on $[0, 2\pi]$ with noise $\sigma_n = 0.2$ . Fit a GP with RBF kernel, optimizing $\ell$ and $\sigma_f$ via marginal likelihood. The posterior mean tracks the true sine function. The 95% confidence band ( $\pm 2$ standard deviations) is narrow near training points and widens in gaps between points. At the edges of the training range ( $x < 0$ or $x > 2\pi$ ), the posterior reverts to the prior mean (zero) with prior variance, correctly reflecting that the GP has no information there.

Summary

GP regression gives a closed-form Gaussian posterior: $\bar{f}_* = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$
Posterior variance is large where data is sparse and small near training points
Kernel choice encodes smoothness and structure assumptions about $f$
Marginal likelihood selects hyperparameters without cross-validation via automatic Occam's razor
Exact GP costs $O(n^3)$ ; sparse GPs with $m$ inducing points cost $O(nm^2)$
GPs are the default surrogate model for Bayesian optimization

Exercises

ExerciseCore

Problem

For a GP with RBF kernel ( $\ell = 1$ , $\sigma_f^2 = 1$ ) and noise $\sigma_n^2 = 0.1$ , you observe one training point: $x_1 = 0$ , $y_1 = 1$ . Compute the posterior mean and variance at $x_* = 0$ and at $x_* = 3$ .

ExerciseAdvanced

Problem

Show that the GP marginal likelihood penalizes both underfitting and overfitting. Specifically, consider the RBF kernel with length scale $\ell$ . Explain what happens to the data-fit term and the complexity term as $\ell \to 0$ and as $\ell \to \infty$ .

Connection to Kriging and RBF Interpolation

GP regression is mathematically identical to kriging, the geostatistical interpolation method developed by Matheron (1963) building on Krige's mining work.

In kriging, the kernel is called the variogram (or its complement, the covariance function), the posterior mean is the kriging predictor, and the posterior variance is the kriging variance. The equations are the same: $\hat{f}(x_*) = \mathbf{k}_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$ . The geostatistical and machine learning communities developed the same theory independently.

Radial basis function (RBF) interpolation is the noise-free special case: set $\sigma_n^2 = 0$ , and the GP posterior mean becomes $\hat{f}(x_*) = \mathbf{k}_*^\top K^{-1} \mathbf{y}$ , which passes exactly through every training point. This is classical RBF interpolation with the kernel $k(x, x')$ serving as the radial basis function. Common choices include the Gaussian ( $\exp(-r^2/2\ell^2)$ ), multiquadric ( $\sqrt{1 + r^2/\ell^2}$ ), and Matern kernels.

The GP perspective adds two things that classical RBF interpolation lacks: principled uncertainty quantification (the posterior variance) and hyperparameter selection via marginal likelihood.

Proposition

Noise-Free GP Equals RBF Interpolation

Statement

With $\sigma_n^2 = 0$ the GP posterior mean reduces to $\hat{f}(x_*) = \mathbf{k}_*^\top K^{-1} \mathbf{y}$ , and this predictor satisfies $\hat{f}(x_i) = y_i$ for every training input $x_i$ . The predictor coincides with the classical RBF interpolant with radial basis function $k(\cdot, \cdot)$ .

Intuition

Without observation noise the model has no reason to smooth past the data. The kernel inverse acts as the change of basis from data values to interpolation weights, and the linear combination of basis functions $k(\cdot, x_i)$ centered at training inputs reproduces the observed values exactly.

Proof Sketch

Apply the GP posterior mean formula at $x_* = x_j$ . The vector $\mathbf{k}_*$ becomes the $j$ -th row of $K$ , so $\mathbf{k}_*^\top K^{-1} \mathbf{y} = \mathbf{e}_j^\top \mathbf{y} = y_j$ . By construction the predictor is a finite linear combination of $k(\cdot, x_i)$ , which is the form of the RBF interpolant; uniqueness of the interpolant in $\operatorname{span}\{k(\cdot, x_i)\}$ for positive definite $K$ identifies the two predictors.

Why It Matters

The kriging / RBF correspondence makes the geostatistics and approximation-theory literatures directly importable into GPs. Convergence rates from RBF approximation theory transfer over.

Failure Mode

With $\sigma_n^2 = 0$ the matrix $K$ becomes ill-conditioned when training inputs cluster, and the explicit $K^{-1}$ amplifies numerical noise. In practice a small jitter $\varepsilon I$ is added for stability; this is a numerical fix, not a model of observation noise.

report a correction →

References

Canonical:

Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Chapters 2, 5
Bishop, Pattern Recognition and Machine Learning (2006), Section 6.4

Kriging and RBF:

Matheron, "Principles of Geostatistics" (Economic Geology, 1963). Founding paper of kriging theory.
Cressie, Statistics for Spatial Data (Rev. ed., 1993), Chapters 3-4
Buhmann, Radial Basis Functions: Theory and Implementations (2003), Chapters 1-3
Wendland, Scattered Data Approximation (2005). Native spaces and convergence theory for RBF interpolation.

Current:

Snelson & Ghahramani, "Sparse Gaussian Processes using Pseudo-inputs" (NeurIPS 2006). The FITC approximation named in the sparse-GP section.
Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
Hensman, Fusi, Lawrence, "Gaussian Processes for Big Data" (UAI 2013)
Wilson & Adams, "Gaussian Process Kernels for Pattern Discovery and Extrapolation" (ICML 2013, arXiv:1302.4245). Spectral mixture kernels: structured kernel learning from data.
Wilson, Hu, Salakhutdinov, Xing, "Deep Kernel Learning" (AISTATS 2016, arXiv:1511.02222). Composing neural-network feature extractors with GP output layers.

Next Topics

Neural tangent kernel: the infinite-width neural network as a GP
Cross-validation theory: alternative model selection when marginal likelihood is unavailable

Last reviewed: May 8, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

8

Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
Gram Matrices and Kernel Matriceslayer 1 · tier 1
Ridge Regressionlayer 1 · tier 1
Bayesian Linear Regressionlayer 2 · tier 1
The Kernel Tricklayer 2 · tier 1

Derived topics

4

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Cross-Validation Theorylayer 2 · tier 2
Bayesian Optimization for Hyperparameterslayer 3 · tier 2
Gaussian Processes in Astronomylayer 4 · tier 3

Graph-backed continuations

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width Cross-Validation Theory Bayesian Optimization for Hyperparameters Gaussian Processes in Astronomy