Gram Matrices and Kernel Matrices

Sneiderman, Robby

Foundations

Gram Matrices and Kernel Matrices

The Gram matrix encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Connects linear algebra to ML.

CoreTier 1StableSupporting~40 min

Prerequisites

Inner Product Spaces and Orthogonality Eigenvalues and Eigenvectors Distance Metrics Compared Matrix Multiplication Algorithms

Quiz (4)Prereq Map

Why This Matters

Gram matrix of pentagon vertices: rank-2 data ⇒ spectrum has only 2 nonzero eigenvalues

The Gram matrix appears everywhere in ML, often without being named. When you do kernel PCA, you operate on a kernel matrix (which is a Gram matrix in feature space). When you compute $X^TX$ for PCA or linear regression, that is a Gram matrix of the columns of $X$ . In transformer attention, $QK^T$ is a closely related cross inner-product matrix, generalizing the Gram structure to two different projections of the inputs (see the Attention section below for the precise relationship).

Understanding the Gram matrix and its properties (especially positive semidefiniteness) is the key link between linear algebra and kernel methods, and between kernel methods and attention.

Definitions

Definition

Gram Matrix $G_{ij} = ⟨ x_{i}, x_{j} ⟩$

Given vectors $x_1, \ldots, x_n$ in an inner product space, the Gram matrix $G \in \mathbb{R}^{n \times n}$ has entries:

$G_{ij} = \langle x_i, x_j \rangle$

If $X \in \mathbb{R}^{n \times d}$ is the matrix with $x_i$ as its $i$ -th row, then $G = XX^T$ .

Definition

Kernel Matrix $K_{ij} = k (x_{i}, x_{j})$

Given a kernel function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ and data points $x_1, \ldots, x_n \in \mathcal{X}$ , the kernel matrix (or Gram matrix of the kernel) has entries:

$K_{ij} = k(x_i, x_j)$

If $k$ is a valid (positive definite) kernel, then by the Moore-Aronszajn theorem there exists a (unique) reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$ and a feature map $\phi: \mathcal{X} \to \mathcal{H}_k$ such that $k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle_{\mathcal{H}_k}$ , so the kernel matrix is the Gram matrix of the mapped data. Mercer's theorem is a stronger statement: when $\mathcal{X}$ is compact and $k$ is continuous, it gives an explicit spectral expansion $k(x, y) = \sum_i \lambda_i \psi_i(x) \psi_i(y)$ in terms of eigenfunctions of the integral operator. For general PD kernels (e.g. on infinite or non-compact domains, or without continuity), Moore-Aronszajn is the right tool; Mercer is a special case.

The standard inner product $\langle x, y \rangle = x^T y$ gives the simplest case. The Gram matrix with the standard inner product is just $G = XX^T$ . A kernel matrix with kernel $k$ generalizes this to $K_{ij} = k(x_i, x_j)$ , which computes inner products in a (possibly infinite-dimensional) feature space.

Positive Semidefiniteness

Theorem

Gram Matrices are Positive Semidefinite

Statement

For any vectors $x_1, \ldots, x_n$ in a real inner product space, the Gram matrix $G$ with $G_{ij} = \langle x_i, x_j \rangle$ is positive semidefinite. That is, for all $c \in \mathbb{R}^n$ :

$c^T G c = \left\| \sum_{i=1}^{n} c_i x_i \right\|^2 \geq 0$

Intuition

The quadratic form $c^T G c$ computes the squared norm of a linear combination of the data vectors. Squared norms are always non-negative. This is why Gram matrices are always PSD: they represent inner product structure, and inner products induce non-negative norms.

Proof Sketch

$c^T G c = \sum_{i,j} c_i c_j \langle x_i, x_j \rangle = \left\langle \sum_i c_i x_i, \sum_j c_j x_j \right\rangle = \left\| \sum_i c_i x_i \right\|^2 \geq 0$

The second equality uses bilinearity of the inner product. The inequality follows from the definition of a norm.

Why It Matters

PSD is the central property of Gram matrices. It guarantees that eigenvalues are non-negative, that the matrix defines a valid (semi-)metric on the data, and that kernel methods are well-posed. Any matrix that arises as pairwise inner products must be PSD.

Failure Mode

The Gram matrix is PSD but may not be positive definite. It is singular (has zero eigenvalues) when the vectors $x_1, \ldots, x_n$ are linearly dependent. For $n > d$ (more points than dimensions), the Gram matrix $XX^T$ has rank at most $d$ , so at least $n - d$ eigenvalues are zero.

report a correction →

Eigenvalues and Data Geometry

Proposition

Gram Matrix Eigenvalues Reflect Data Geometry

Statement

The non-zero eigenvalues of $G = XX^T$ are the same as the non-zero eigenvalues of $X^TX$ . If the SVD of $X$ is $X = U\Sigma V^T$ , then:

$G = XX^T = U\Sigma^2 U^T$

The eigenvalues of $G$ are $\sigma_1^2, \ldots, \sigma_r^2, 0, \ldots, 0$ where $r = \text{rank}(X)$ and $\sigma_i$ are the singular values of $X$ .

Intuition

The eigenvalues of the Gram matrix tell you how the data is spread in different directions. Large eigenvalues correspond to directions of high variance in the data. This is exactly the information PCA extracts. Computing PCA from $G = XX^T$ (dual PCA) is equivalent to computing it from $X^TX$ (primal PCA), but the matrix sizes differ: $G$ is $n \times n$ while $X^TX$ is $d \times d$ .

Proof Sketch

If $X = U\Sigma V^T$ , then $XX^T = U\Sigma V^T V\Sigma^T U^T = U\Sigma\Sigma^T U^T = U\Sigma^2 U^T$ . Similarly, $X^TX = V\Sigma^2 V^T$ . Both have eigenvalues $\sigma_i^2$ .

Why It Matters

This duality between $XX^T$ and $X^TX$ is the foundation of kernel PCA. When $d \gg n$ (more features than samples), working with the $n \times n$ Gram matrix is cheaper. When you use a kernel, you only have access to $K$ (the Gram matrix in feature space), and the eigendecomposition of $K$ gives you PCA in feature space.

Failure Mode

Computing the full eigendecomposition of $G$ costs $O(n^3)$ . For large $n$ , this is prohibitive, and you need approximate methods (randomized SVD, Nystrom approximation). Also, the Gram matrix requires $O(n^2)$ storage, which limits applicability to datasets with many observations.

report a correction →

Connections to ML

Kernel Methods

In kernel methods, you work with $K_{ij} = k(x_i, x_j)$ and never compute the feature map $\phi$ explicitly. The kernel matrix $K$ is all you need for:

Kernel SVM: the dual formulation involves $K$ directly
Kernel PCA: eigendecompose $K$ to get principal components in feature space
Gaussian processes: the kernel matrix (plus noise) is the covariance matrix of the function values

Attention

In a transformer, the attention matrix is:

$A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$

In a transformer, $Q = XW_Q$ and $K = XW_K$ with generally $W_Q \neq W_K$ , so $(QK^T)_{ij} = q_i^T k_j$ is a cross inner-product matrix: it records inner products between two different linear projections of the input tokens. This is not a Gram matrix in the strict sense. It is generally neither symmetric nor positive semidefinite (Exercise 2 works this out). Before the softmax, $QK^T$ can be read as a bilinear similarity score under the learned metric $W_Q W_K^T$ . Only in the special case $W_Q = W_K$ (tied weights) does $QK^T$ reduce to a true Gram matrix. The Gram-matrix intuition (pairwise similarities) transfers; the PSD property does not.

PCA and SVD

PCA on $n$ data points in $\mathbb{R}^d$ can be computed from either $X^TX$ ( $d \times d$ , primal) or $XX^T$ ( $n \times n$ , dual). The dual formulation uses the Gram matrix and is the starting point for kernel PCA.

Common Confusions

Watch Out

Gram matrix vs covariance matrix

The Gram matrix is $XX^T$ (size $n \times n$ , pairwise similarities between data points). The covariance matrix is $X^TX / n$ (size $d \times d$ , pairwise relationships between features), assuming centered data. They encode complementary views of the same dataset: point similarities vs feature correlations.

Watch Out

PSD does not mean all entries are non-negative

A Gram matrix can have negative entries. For example, if $x_1 = (1, 0)$ and $x_2 = (-1, 0)$ , then $G_{12} = -1$ . PSD refers to the quadratic form $c^TGc \geq 0$ for all $c$ , not to individual entries.

Summary

Gram matrix: $G_{ij} = \langle x_i, x_j \rangle$ . For data matrix $X$ : $G = XX^T$
Always PSD. Eigenvalues are non-negative.
Kernel matrix: same structure with $k(x_i, x_j)$ replacing $\langle x_i, x_j \rangle$
Eigenvalues of $G$ equal squared singular values of $X$
Appears in kernel methods, PCA (dual formulation), and SVD. Attention's $QK^T$ is a related cross inner-product matrix (symmetric/PSD only with tied $W_Q = W_K$ )

Exercises

ExerciseCore

Problem

Given $x_1 = (1, 2)$ , $x_2 = (3, 0)$ , $x_3 = (0, 1)$ , compute the Gram matrix $G = XX^T$ where rows of $X$ are $x_1, x_2, x_3$ . Verify that $G$ is PSD by checking that all eigenvalues are non-negative.

ExerciseAdvanced

Problem

Show that the attention score matrix $QK^T$ in a transformer is not a Gram matrix in general. Under what tying conditions on the projections $W_Q, W_K$ (or on the inputs) does $QK^T$ reduce to a Gram matrix, and what does that imply about its eigenvalues? Construct an explicit example where $QK^T$ has a negative eigenvalue.

References

Canonical:

Horn & Johnson, Matrix Analysis (2013), Chapter 7 (Positive Definite Matrices)
Axler, Linear Algebra Done Right (2024), Chapter 6
Schoelkopf & Smola, Learning with Kernels (2002), Chapter 2
Shawe-Taylor & Cristianini, Kernel Methods for Pattern Analysis (2004), Chapters 3-4
Aronszajn, "Theory of Reproducing Kernels", Transactions of the American Mathematical Society (1950)

Current:

Vaswani et al., "Attention Is All You Need" (2017), for the attention connection
Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 16
Wainwright, High-Dimensional Statistics (2019), Chapter 12
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 16

Next Topics

Kernels and RKHS: the theoretical foundation for kernel matrices
Principal component analysis: PCA from the dual (Gram matrix) perspective
Attention mechanism theory: where Gram matrices meet modern deep learning

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Eigenvalues and Eigenvectorslayer 0A · tier 1
Inner Product Spaces and Orthogonalitylayer 0A · tier 1
The Kernel Tricklayer 2 · tier 1
Distance Metrics Comparedlayer 1 · tier 2
Matrix Multiplication Algorithmslayer 1 · tier 2

Derived topics

5

Principal Component Analysislayer 1 · tier 1
Gaussian Process Regressionlayer 3 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Attention Mechanism Theorylayer 4 · tier 2
Gaussian Processes for Machine Learninglayer 4 · tier 3

Graph-backed continuations

Kernels and Reproducing Kernel Hilbert Spaces Principal Component Analysis Attention Mechanism Theory Gaussian Processes for Machine Learning Gaussian Process Regression