Skip to main content

Foundations

Gram Matrices and Kernel Matrices

The Gram matrix encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Connects linear algebra to ML.

CoreTier 1StableSupporting~40 min

Why This Matters

Gram matrix of pentagon vertices: rank-2 data ⇒ spectrum has only 2 nonzero eigenvalues

1. Data in ℝ²
2. Gram matrix K (5×5)
1.000.31-0.81-0.810.310.311.000.31-0.81-0.81-0.810.311.000.31-0.81-0.81-0.810.311.000.310.31-0.81-0.810.311.00
3. Eigenvalues of K
0
0
0

The Gram matrix appears everywhere in ML, often without being named. When you do kernel PCA, you operate on a kernel matrix (which is a Gram matrix in feature space). When you compute XTXX^TX for PCA or linear regression, that is a Gram matrix of the columns of XX. In transformer attention, QKTQK^T is a closely related cross inner-product matrix, generalizing the Gram structure to two different projections of the inputs (see the Attention section below for the precise relationship).

Understanding the Gram matrix and its properties (especially positive semidefiniteness) is the key link between linear algebra and kernel methods, and between kernel methods and attention.

Definitions

Definition

Gram Matrix

Given vectors x1,,xnx_1, \ldots, x_n in an inner product space, the Gram matrix GRn×nG \in \mathbb{R}^{n \times n} has entries:

Gij=xi,xjG_{ij} = \langle x_i, x_j \rangle

If XRn×dX \in \mathbb{R}^{n \times d} is the matrix with xix_i as its ii-th row, then G=XXTG = XX^T.

Definition

Kernel Matrix

Given a kernel function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} and data points x1,,xnXx_1, \ldots, x_n \in \mathcal{X}, the kernel matrix (or Gram matrix of the kernel) has entries:

Kij=k(xi,xj)K_{ij} = k(x_i, x_j)

If kk is a valid (positive definite) kernel, then by the Moore-Aronszajn theorem there exists a (unique) reproducing kernel Hilbert space (RKHS) Hk\mathcal{H}_k and a feature map ϕ:XHk\phi: \mathcal{X} \to \mathcal{H}_k such that k(xi,xj)=ϕ(xi),ϕ(xj)Hkk(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle_{\mathcal{H}_k}, so the kernel matrix is the Gram matrix of the mapped data. Mercer's theorem is a stronger statement: when X\mathcal{X} is compact and kk is continuous, it gives an explicit spectral expansion k(x,y)=iλiψi(x)ψi(y)k(x, y) = \sum_i \lambda_i \psi_i(x) \psi_i(y) in terms of eigenfunctions of the integral operator. For general PD kernels (e.g. on infinite or non-compact domains, or without continuity), Moore-Aronszajn is the right tool; Mercer is a special case.

The standard inner product x,y=xTy\langle x, y \rangle = x^T y gives the simplest case. The Gram matrix with the standard inner product is just G=XXTG = XX^T. A kernel matrix with kernel kk generalizes this to Kij=k(xi,xj)K_{ij} = k(x_i, x_j), which computes inner products in a (possibly infinite-dimensional) feature space.

Positive Semidefiniteness

Theorem

Gram Matrices are Positive Semidefinite

Statement

For any vectors x1,,xnx_1, \ldots, x_n in a real inner product space, the Gram matrix GG with Gij=xi,xjG_{ij} = \langle x_i, x_j \rangle is positive semidefinite. That is, for all cRnc \in \mathbb{R}^n:

cTGc=i=1ncixi20c^T G c = \left\| \sum_{i=1}^{n} c_i x_i \right\|^2 \geq 0

Intuition

The quadratic form cTGcc^T G c computes the squared norm of a linear combination of the data vectors. Squared norms are always non-negative. This is why Gram matrices are always PSD: they represent inner product structure, and inner products induce non-negative norms.

Proof Sketch

cTGc=i,jcicjxi,xj=icixi,jcjxj=icixi20c^T G c = \sum_{i,j} c_i c_j \langle x_i, x_j \rangle = \left\langle \sum_i c_i x_i, \sum_j c_j x_j \right\rangle = \left\| \sum_i c_i x_i \right\|^2 \geq 0

The second equality uses bilinearity of the inner product. The inequality follows from the definition of a norm.

Why It Matters

PSD is the central property of Gram matrices. It guarantees that eigenvalues are non-negative, that the matrix defines a valid (semi-)metric on the data, and that kernel methods are well-posed. Any matrix that arises as pairwise inner products must be PSD.

Failure Mode

The Gram matrix is PSD but may not be positive definite. It is singular (has zero eigenvalues) when the vectors x1,,xnx_1, \ldots, x_n are linearly dependent. For n>dn > d (more points than dimensions), the Gram matrix XXTXX^T has rank at most dd, so at least ndn - d eigenvalues are zero.

Eigenvalues and Data Geometry

Proposition

Gram Matrix Eigenvalues Reflect Data Geometry

Statement

The non-zero eigenvalues of G=XXTG = XX^T are the same as the non-zero eigenvalues of XTXX^TX. If the SVD of XX is X=UΣVTX = U\Sigma V^T, then:

G=XXT=UΣ2UTG = XX^T = U\Sigma^2 U^T

The eigenvalues of GG are σ12,,σr2,0,,0\sigma_1^2, \ldots, \sigma_r^2, 0, \ldots, 0 where r=rank(X)r = \text{rank}(X) and σi\sigma_i are the singular values of XX.

Intuition

The eigenvalues of the Gram matrix tell you how the data is spread in different directions. Large eigenvalues correspond to directions of high variance in the data. This is exactly the information PCA extracts. Computing PCA from G=XXTG = XX^T (dual PCA) is equivalent to computing it from XTXX^TX (primal PCA), but the matrix sizes differ: GG is n×nn \times n while XTXX^TX is d×dd \times d.

Proof Sketch

If X=UΣVTX = U\Sigma V^T, then XXT=UΣVTVΣTUT=UΣΣTUT=UΣ2UTXX^T = U\Sigma V^T V\Sigma^T U^T = U\Sigma\Sigma^T U^T = U\Sigma^2 U^T. Similarly, XTX=VΣ2VTX^TX = V\Sigma^2 V^T. Both have eigenvalues σi2\sigma_i^2.

Why It Matters

This duality between XXTXX^T and XTXX^TX is the foundation of kernel PCA. When dnd \gg n (more features than samples), working with the n×nn \times n Gram matrix is cheaper. When you use a kernel, you only have access to KK (the Gram matrix in feature space), and the eigendecomposition of KK gives you PCA in feature space.

Failure Mode

Computing the full eigendecomposition of GG costs O(n3)O(n^3). For large nn, this is prohibitive, and you need approximate methods (randomized SVD, Nystrom approximation). Also, the Gram matrix requires O(n2)O(n^2) storage, which limits applicability to datasets with many observations.

Connections to ML

Kernel Methods

In kernel methods, you work with Kij=k(xi,xj)K_{ij} = k(x_i, x_j) and never compute the feature map ϕ\phi explicitly. The kernel matrix KK is all you need for:

  • Kernel SVM: the dual formulation involves KK directly
  • Kernel PCA: eigendecompose KK to get principal components in feature space
  • Gaussian processes: the kernel matrix (plus noise) is the covariance matrix of the function values

Attention

In a transformer, the attention matrix is:

A=softmax(QKTdk)A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)

In a transformer, Q=XWQQ = XW_Q and K=XWKK = XW_K with generally WQWKW_Q \neq W_K, so (QKT)ij=qiTkj(QK^T)_{ij} = q_i^T k_j is a cross inner-product matrix: it records inner products between two different linear projections of the input tokens. This is not a Gram matrix in the strict sense. It is generally neither symmetric nor positive semidefinite (Exercise 2 works this out). Before the softmax, QKTQK^T can be read as a bilinear similarity score under the learned metric WQWKTW_Q W_K^T. Only in the special case WQ=WKW_Q = W_K (tied weights) does QKTQK^T reduce to a true Gram matrix. The Gram-matrix intuition (pairwise similarities) transfers; the PSD property does not.

PCA and SVD

PCA on nn data points in Rd\mathbb{R}^d can be computed from either XTXX^TX (d×dd \times d, primal) or XXTXX^T (n×nn \times n, dual). The dual formulation uses the Gram matrix and is the starting point for kernel PCA.

Common Confusions

Watch Out

Gram matrix vs covariance matrix

The Gram matrix is XXTXX^T (size n×nn \times n, pairwise similarities between data points). The covariance matrix is XTX/nX^TX / n (size d×dd \times d, pairwise relationships between features), assuming centered data. They encode complementary views of the same dataset: point similarities vs feature correlations.

Watch Out

PSD does not mean all entries are non-negative

A Gram matrix can have negative entries. For example, if x1=(1,0)x_1 = (1, 0) and x2=(1,0)x_2 = (-1, 0), then G12=1G_{12} = -1. PSD refers to the quadratic form cTGc0c^TGc \geq 0 for all cc, not to individual entries.

Summary

  • Gram matrix: Gij=xi,xjG_{ij} = \langle x_i, x_j \rangle. For data matrix XX: G=XXTG = XX^T
  • Always PSD. Eigenvalues are non-negative.
  • Kernel matrix: same structure with k(xi,xj)k(x_i, x_j) replacing xi,xj\langle x_i, x_j \rangle
  • Eigenvalues of GG equal squared singular values of XX
  • Appears in kernel methods, PCA (dual formulation), and SVD. Attention's QKTQK^T is a related cross inner-product matrix (symmetric/PSD only with tied WQ=WKW_Q = W_K)

Exercises

ExerciseCore

Problem

Given x1=(1,2)x_1 = (1, 2), x2=(3,0)x_2 = (3, 0), x3=(0,1)x_3 = (0, 1), compute the Gram matrix G=XXTG = XX^T where rows of XX are x1,x2,x3x_1, x_2, x_3. Verify that GG is PSD by checking that all eigenvalues are non-negative.

ExerciseAdvanced

Problem

Show that the attention score matrix QKTQK^T in a transformer is not a Gram matrix in general. Under what tying conditions on the projections WQ,WKW_Q, W_K (or on the inputs) does QKTQK^T reduce to a Gram matrix, and what does that imply about its eigenvalues? Construct an explicit example where QKTQK^T has a negative eigenvalue.

References

Canonical:

  • Horn & Johnson, Matrix Analysis (2013), Chapter 7 (Positive Definite Matrices)
  • Axler, Linear Algebra Done Right (2024), Chapter 6
  • Schoelkopf & Smola, Learning with Kernels (2002), Chapter 2
  • Shawe-Taylor & Cristianini, Kernel Methods for Pattern Analysis (2004), Chapters 3-4
  • Aronszajn, "Theory of Reproducing Kernels", Transactions of the American Mathematical Society (1950)

Current:

  • Vaswani et al., "Attention Is All You Need" (2017), for the attention connection
  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 16
  • Wainwright, High-Dimensional Statistics (2019), Chapter 12
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 16

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5