Positive Semidefinite Matrices

Sneiderman, Robby

Foundations

Positive Semidefinite Matrices

PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics.

CoreTier 1StableSupporting~25 min

Prerequisites

Eigenvalues and Eigenvectors

Quiz (26)Pulse Check Prereq Map

Why This Matters

Positive semidefiniteness is the matrix analogue of non-negativity for scalars. It appears in three central places in ML:

Covariance matrices are always PSD. If is a covariance matrix, then $x^T \Sigma x = \text{Var}(x^T Z) \geq 0$ for any vector $x$ . The notation ("PSD") and (transpose) recur on every line.

Hessians of convex functions are PSD. If $f$ is twice differentiable on an open convex set $C$ , then $f$ is convex on $C$ if and only if $\nabla^2 f(x) \succeq 0$ for all $x \in C$ . The convex-domain hypothesis is essential — convexity is a statement about line segments staying inside the domain, so the Hessian condition alone on a non-convex set does not imply convexity. This connects PSD matrices to the theory of convex optimization.

Kernel matrices (Gram matrices) are PSD by construction. The entire theory of kernel methods rests on this fact.

The key habit here is to stop treating "PSD" as a mysterious adjective and start translating it immediately into one of three concrete statements: every quadratic form $x^\top A x$ is nonnegative, every eigenvalue is nonnegative, or the matrix can be factored as a Gram object like $B^\top B$ . Strong intuition comes from moving fluidly between those three views rather than memorizing a single definition in isolation.

That fluency matters because different ML arguments naturally live in different languages. Optimization proofs usually speak in Hessians and curvature, statistics speaks in covariance and Fisher information, and kernel methods speak in Gram matrices. PSD is the bridge connecting all three.

Five Equivalent Characterizations and When to Use Each

The PSD Equivalence Theorem (below) gives five different ways to define the same property. Each is useful in a different context. The table is the practical decision aid:

Characterization	What you check	When to reach for it	Cost
Quadratic form $x^\top A x \ge 0$	Algebraic identity for all $x$	Theoretical proofs; checking that a constructed matrix is PSD by definition	Symbolic
Eigenvalues $\lambda_i \ge 0$	Spectrum is non-negative	Numerical certificates; small matrices; when you already have the eigendecomposition	$O(n^3)$
Gram factorization $A = B^\top B$	$A$ factors through some $B$	Sampling from Gaussians; constructing kernels; proving PSD by exhibiting $B$	$O(n^2 r)$ where $r$ is the rank of $B$
Cholesky $A = LL^\top$	$L$ exists with positive diagonal (PD case)	Solving linear systems; log-determinant computation; sampling	$O(n^3/3)$
Principal minors all $\ge 0$	Sign check on $2^n - 1$ submatrix determinants	Symbolic / parametric proofs over a single small example; rarely the right tool numerically	Exponential in $n$

Pick by context. Optimization proofs usually live in eigenvalues and quadratic forms. Bayesian sampling lives in Cholesky and Gram. Symbolic algebra sometimes wants principal minors. Numerical software almost always uses Cholesky internally because it is the cheapest direct test for positive-definiteness on a real machine. The standard ML fluency is to move between the first three views without friction.

Quick Version

Quadratic-form test. $A \succeq 0$ means $x^\top A x \ge 0$ for every vector $x$ .
Spectral test. For symmetric matrices, PSD is equivalent to all eigenvalues being non-negative.
Factorization view. PSD matrices can be written as $A = B^\top B$ ; PD matrices also admit a clean Cholesky factorization $A = LL^\top$ .
Why ML cares. Covariance matrices, Hessians of convex objectives, Gram matrices, and Fisher information all live in the PSD world.

Core Definitions

Definition

Positive Semidefinite Matrix $A ⪰ 0$

A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is positive semidefinite (PSD) if and only if:

$x^T A x \geq 0 \quad \text{for all } x \in \mathbb{R}^n$

The matrix is positive definite (PD), written $A \succ 0$ , if and only if the inequality is strict for all $x \neq 0$ .

Definition

Loewner Ordering $A ⪰ B$

For symmetric matrices $A$ and $B$ , the Loewner ordering is:

$A \succeq B \iff A - B \succeq 0 \iff x^T(A - B)x \geq 0 \text{ for all } x$

This is a partial order on symmetric matrices. It is not a total order: most pairs of matrices are incomparable.

Main Theorems

Theorem

PSD Equivalence Theorem

Statement

The following are equivalent for a symmetric matrix $A \in \mathbb{R}^{n \times n}$ :

$A \succeq 0$ (i.e., $x^T A x \geq 0$ for all $x$ )
All eigenvalues of $A$ are non-negative
$A = B^T B$ for some matrix $B$
All principal minors of $A$ are non-negative
There exists a lower triangular $L$ with non-negative diagonal such that $A = LL^T$ (Cholesky decomposition, if $A \succ 0$ )

Intuition

PSD means the quadratic form $x^T A x$ is a bowl that never dips below zero. Eigenvalues are the curvatures along the principal axes. Non-negative curvature in every direction means non-negative eigenvalues.

Proof Sketch

$(1 \Rightarrow 2)$ : If $Av = \lambda v$ with $\|v\| = 1$ , then $\lambda = v^T A v \geq 0$ . $(2 \Rightarrow 3)$ : By the spectral theorem, $A = Q \Lambda Q^T$ where $\Lambda$ has non-negative entries. Set $B = \Lambda^{1/2} Q^T$ . $(3 \Rightarrow 1)$ : $x^T A x = x^T B^T B x = \|Bx\|^2 \geq 0$ .

Why It Matters

Different characterizations are useful in different contexts. Checking eigenvalues is $O(n^3)$ . Checking the quadratic form definition is useful for proofs. The factorization $A = B^T B$ is used in sampling from Gaussians: if $Z \sim N(0, I)$ and $\Sigma = LL^T$ , then $LZ \sim N(0, \Sigma)$ .

Failure Mode

Symmetry is required. A non-symmetric matrix can satisfy $x^T A x \geq 0$ for all $x$ without having real eigenvalues. The standard convention in ML is that PSD refers to symmetric (or Hermitian) matrices only.

The Sylvester-criterion subtlety: for positive definiteness ( $A \succ 0$ ) it is enough that all leading principal minors are positive — a clean, easy-to-check criterion. For positive semidefiniteness ( $A \succeq 0$ ) leading principal minors being non-negative is not enough; you need every principal minor (not just the leading ones) to be non-negative. A standard counterexample is $A = \mathrm{diag}(0, -1)$ : the leading principal minor of size 1 is 0 (non-negative) and the determinant is 0 (also non-negative), yet the eigenvalue $-1$ shows $A \not\succeq 0$ . Adding the bottom-right principal minor $-1$ to the check would have caught it.

report a correction →

Cholesky Decomposition

Every positive definite matrix $A$ has a unique factorization $A = LL^T$ where $L$ is lower triangular with positive diagonal entries. This is the Cholesky decomposition.

Computational cost: $O(n^3/3)$ , which is half the cost of a general LU decomposition. It is the preferred method for solving linear systems $Ax = b$ when $A$ is known to be PD: factor once, then solve $Ly = b$ and $L^T x = y$ by back-substitution.

For PSD (not PD) matrices, the Cholesky factorization exists but $L$ may have zeros on the diagonal.

Schur Complement

Definition

Schur Complement

Given a symmetric block matrix:

$M = \begin{pmatrix} A & B \\ B^T & C \end{pmatrix}$

with $A$ and $C$ symmetric and $A$ invertible, the Schur complement of $A$ in $M$ is:

$S = C - B^T A^{-1} B$

The key fact: if $A \succ 0$ , then $M \succeq 0$ if and only if $S \succeq 0$ . Block-convention warning: under the alternative layout $M = \begin{pmatrix} A & B^\top \\ B & C \end{pmatrix}$ (with $B$ in the lower-left block) the Schur complement of $A$ becomes $S = C - B A^{-1} B^\top$ . Symmetry of $M$ and the role of Loewner order both depend on this convention.

The Schur complement appears in conditional distributions: if $(X, Y) \sim N(0, M)$ , then the conditional covariance of $Y$ given $X$ is the Schur complement of the $X$ -block.

Properties of PSD Matrices

Sum: if $A \succeq 0$ and $B \succeq 0$ , then $A + B \succeq 0$
Scaling: if $A \succeq 0$ and $\alpha \geq 0$ , then $\alpha A \succeq 0$
Congruence: if $A \succeq 0$ , then $BAB^T \succeq 0$ for any matrix $B$
Trace: if $A \succeq 0$ , then $\text{tr}(A) \geq 0$ (sum of eigenvalues)
Hadamard product: if $A \succeq 0$ and $B \succeq 0$ , then $A \circ B \succeq 0$ (Schur product theorem)

The set of PSD matrices forms a convex cone: it is closed under addition and non-negative scalar multiplication.

Common Confusions

Watch Out

PSD is about the symmetric part

For a non-symmetric matrix $M$ , the quadratic form $x^T M x = x^T \frac{M + M^T}{2} x$ . So the sign of the quadratic form depends only on the symmetric part $(M + M^T)/2$ . When someone says "the Hessian is PSD," this is only meaningful because Hessians of twice-differentiable functions are symmetric.

Watch Out

Positive entries do not imply PSD

The matrix $\begin{pmatrix} 1 & 2 \\ 2 & 1 \end{pmatrix}$ has all positive entries but eigenvalues $3$ and $-1$ , so it is not PSD. Conversely, the identity matrix is PSD despite having zeros off the diagonal.

Exercises

ExerciseCore

Problem

Prove that every covariance matrix is PSD. That is, if $\Sigma = \mathbb{E}[(X - \mu)(X - \mu)^T]$ , show $\Sigma \succeq 0$ .

ExerciseAdvanced

Problem

Let $A \succ 0$ be $n \times n$ positive definite. Prove that $A^{-1} \succ 0$ .

References

Canonical:

Horn & Johnson, Matrix Analysis (2013), Chapter 7
Strang, Linear Algebra and Its Applications (2006), Section 6.5
Bhatia, Positive Definite Matrices (2007), Chapters 1-2 (characterizations and Loewner ordering)

Current:

Boyd & Vandenberghe, Convex Optimization (2004), Section 2.3 (PSD cone)
Golub & Van Loan, Matrix Computations (2013), Chapter 4 (Cholesky decomposition)
Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Section 4.3 (positive definite matrices and Cholesky)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Eigenvalues and Eigenvectorslayer 0A · tier 1

Derived topics

5

Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Principal Component Analysislayer 1 · tier 1
3D Gaussian Splattinglayer 4 · tier 3

Graph-backed continuations

Convex Optimization Basics Principal Component Analysis Fisher Information: Curvature, KL Geometry, and the Natural Gradient 3D Gaussian Splatting The Multivariate Normal Distribution