Conjugate Gradient Methods

Sneiderman, Robby

Numerical Optimization

Conjugate Gradient Methods

Conjugate gradient solves large symmetric positive definite linear systems using matrix-vector products, Krylov subspaces, A-conjugate directions, condition-number-dependent rates, and preconditioning.

AdvancedTier 2StableSupporting~60 min

Prerequisites

Line Search Methods Matrix Operations and Properties Matrix Norms Numerical Linear Algebra

Quiz (2)Prereq Map

Why This Matters

Many ML and numerical optimization problems require solving

$Ax=b$

where $A$ is too large to factor. Direct Cholesky costs cubic time and requires explicit matrix storage. Conjugate gradient (CG) uses only matrix-vector products $Av$ , so it works when $A$ is sparse, structured, implicit, or available through Hessian-vector products.

This is why CG appears in least squares, Gaussian processes, kernel methods, natural gradients, Hessian-free optimization, and TRPO-style policy optimization.

Quadratic View

For symmetric positive definite $A$ , solving $Ax=b$ is equivalent to minimizing

$q(x)=(1/2)x^TAx-b^Tx.$

The gradient is

$\nabla q(x)=Ax-b.$

The residual $r_k=b-Ax_k$ is therefore the negative gradient. CG is an optimization method for this special quadratic, but it is also a linear solver.

A-Conjugacy

Definition

A-Conjugate Directions

Nonzero vectors $d_0,\ldots,d_k$ are $A$ -conjugate when

$d_i^T A d_j=0\quad\text{for}\quad i\ne j.$

This is ordinary orthogonality under the inner product $\langle u,v\rangle_A=u^TAv$ .

Steepest descent uses directions that are orthogonal in the Euclidean sense only at consecutive steps. CG uses directions that are mutually orthogonal in the geometry induced by $A$ . That is the key reason it avoids the classic zigzag on elongated ellipses.

Krylov Subspaces

Definition

Krylov Subspace

Starting from residual $r_0=b-Ax_0$ , the order- $k$ Krylov subspace is $\mathcal{K}_k(A,r_0)=\mathrm{span}\{r_0,Ar_0,\ldots,A^{k-1}r_0\}$ .

CG chooses $x_k$ from $x_0+\mathcal{K}_k(A,r_0)$ . Each iteration expands the search space by one matrix-vector product.

Theorem

CG Variational Characterization

Statement

The $k$ th CG iterate is the unique minimizer of $q(x)$ over the affine Krylov space:

$x_k=\arg\min_{x\in x_0+\mathcal{K}_k(A,r_0)} q(x).$

Equivalently, $x_k$ minimizes the $A$ -norm error over that same space.

Intuition

Each iteration adds one new direction. CG picks the best point available in all directions discovered so far, not merely the best point along the newest direction.

Proof Sketch

First-order optimality over the affine subspace requires the residual to be orthogonal to the Krylov subspace. The CG recurrences maintain this orthogonality while generating an $A$ -conjugate basis for the same Krylov space.

report a correction →

The Algorithm

For SPD $A$ and initial guess $x_0$ :

Set $r_0=b-Ax_0$ and $d_0=r_0$ .
Compute $\alpha_k=(r_k^Tr_k)/(d_k^TAd_k)$ .
Set $x_{k+1}=x_k+\alpha_k d_k$ .
Set $r_{k+1}=r_k-\alpha_k A d_k$ .
Compute $\beta_{k+1}=(r_{k+1}^Tr_{k+1})/(r_k^Tr_k)$ .
Set $d_{k+1}=r_{k+1}+\beta_{k+1}d_k$ .

Each iteration uses one matrix-vector product, a few dot products, and vector saxpy operations. That is the computational win.

Main Guarantees

Theorem

Finite Termination In Exact Arithmetic

Statement

CG reaches the exact solution in at most $n$ iterations for an $n$ by $n$ SPD matrix. More sharply, it terminates in at most the degree of the minimal polynomial of $A$ relative to the initial residual.

Intuition

There can be at most $n$ mutually $A$ -conjugate nonzero directions in $R^n$ . Once the search space contains the exact error direction, CG has no error left.

Proof Sketch

The CG directions are $A$ -conjugate and span the Krylov spaces. The variational characterization says $x_k$ is optimal over the current Krylov space. When the Krylov space captures the whole invariant subspace generated by the initial residual, the error is zero.

Failure Mode

Floating-point arithmetic destroys exact conjugacy. Practical CG stops by tolerance, not by waiting for exactly $n$ iterations.

report a correction →

Theorem

Condition-Number Convergence Bound

Statement

Let $\kappa=\lambda_{\max}(A)/\lambda_{\min}(A)$ . Then

$\|x_k-x^*\|_A \leq 2\left((\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)\right)^k\|x_0-x^*\|_A.$

Intuition

CG depends on the square root of the condition number in this worst-case bound. That is a major improvement over steepest descent on ill-conditioned quadratics.

Proof Sketch

CG error can be written as $p_k(A)e_0$ where $p_k$ is a degree- $k$ polynomial with $p_k(0)=1$ . The method chooses the best such polynomial for the spectrum of $A$ . Chebyshev polynomials give the stated worst-case bound on the spectral interval.

Why It Matters

The bound explains preconditioning: change the system so the eigenvalues cluster and the effective condition number shrinks.

Failure Mode

This bound can be pessimistic. Eigenvalue clustering often matters more than the raw condition number. If $A$ has only a few distinct eigenvalue clusters, CG can converge much faster than the bound predicts.

report a correction →

Preconditioning

Preconditioning replaces the original system by one with a better spectrum. A left preconditioner uses

$M^{-1}Ax=M^{-1}b,$

where $M$ is cheap to solve with and approximates $A$ . For SPD systems, the preconditioned method is usually written in a symmetric form so the CG assumptions remain valid.

Good preconditioners reduce iterations. Bad preconditioners add solve cost without improving the spectrum enough.

Preconditioner	Useful when	Tradeoff
diagonal / Jacobi	diagonal scaling is poor	cheap, weak
incomplete Cholesky	sparse SPD systems	stronger, setup cost
low-rank plus diagonal	kernel or Hessian approximations	uses structure
multigrid	PDE-like systems	very strong when geometry fits

Nonlinear Conjugate Gradient

For a general smooth objective $f(x)$ , nonlinear CG replaces residuals with negative gradients and uses a line search:

$d_k=-g_k+\beta_k d_{k-1}.$

Two common formulas:

Fletcher-Reeves: $\beta_k=\|g_k\|^2/\|g_{k-1}\|^2$ .
Polak-Ribiere: $\beta_k=g_k^T(g_k-g_{k-1})/\|g_{k-1}\|^2$ .

Nonlinear CG no longer has exact finite termination. Strong Wolfe line search and periodic restarts matter because conjugacy is only approximate away from quadratics.

ML Connections

Least squares: CG solves normal equations or better-conditioned equivalent systems when direct factorization is too expensive.
Gaussian processes: kernel systems can be solved by CG using fast kernel-vector products.
Natural gradient and TRPO: CG approximately solves Fisher-vector systems without forming the Fisher matrix.
Hessian-free optimization: CG solves Newton-like systems using Hessian-vector products.

Common Confusions

Watch Out

CG is not momentum

Momentum averages past gradients. CG constructs directions that are conjugate under $A$ . For quadratics, that geometry gives finite termination in exact arithmetic; momentum does not.

Watch Out

Standard CG requires SPD matrices

If $A$ is not symmetric positive definite, standard CG is not the right method. Use MINRES for symmetric indefinite systems, or GMRES/BiCGSTAB for non-symmetric systems.

Watch Out

The residual you compute may not equal the true residual forever

Floating-point recurrence updates can drift from $b-Ax_k$ . Robust implementations occasionally recompute the true residual.

Watch Out

Normal equations can square the condition number

Applying CG to $A^TAx=A^Tb$ may worsen conditioning. When possible, use methods designed for least squares or a better formulation.

Q&A For Mastery

Why is CG fast when it uses no Hessian inverse? It builds a sequence of Krylov subspaces from repeated matrix-vector products. Those subspaces capture spectral information about $A$ without factorizing $A$ .

What is the first debugging plot? Plot relative residual norm versus iteration on a log scale. A flat curve usually means poor conditioning, bad preconditioning, or loss of conjugacy.

When should I stop? Use a residual tolerance tied to the downstream task. Solving the linear system to machine precision is often wasted if the outer optimization or data noise dominates.

What To Remember

CG solves SPD systems using matrix-vector products.
It minimizes the quadratic over a growing Krylov subspace.
$A$ -conjugate directions prevent undoing earlier progress.
Exact arithmetic gives finite termination; floating point gives tolerance-based convergence.
Preconditioning is often the difference between useful CG and stalled CG.

Exercises

ExerciseCore

Problem

Why must $d_k^TAd_k$ be positive in the CG step-size formula?

ExerciseAdvanced

Problem

If $A$ has exactly $r$ distinct eigenvalues, explain why CG can terminate in at most $r$ iterations in exact arithmetic.

ExerciseResearch

Problem

Why can preconditioning improve CG even if it makes each iteration more expensive?

References

Canonical:

Hestenes and Stiefel, "Methods of Conjugate Gradients for Solving Linear Systems" (1952).
Trefethen and Bau, Numerical Linear Algebra, Lectures 38-39.
Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 5.
Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., Chapters 6 and 9.

Further:

Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain (1994).
Greenbaum, Iterative Methods for Solving Linear Systems, Chapters 2-3.
Martens, "Deep Learning via Hessian-Free Optimization" (2010).

Next Topics

Preconditioned optimizers: how matrix geometry changes optimization speed.
Second-order optimization methods: Newton, Hessian-free, and curvature-aware ML.
Trust region methods: why CG often appears inside trust-region subproblems.

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Matrix Normslayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1
Numerical Linear Algebralayer 1 · tier 2
Line Search Methodslayer 2 · tier 2

Derived topics

3

Trust Region Methodslayer 2 · tier 2
Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradientlayer 3 · tier 2
Second-Order Optimization Methodslayer 3 · tier 2

Graph-backed continuations

Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient Second-Order Optimization Methods Trust Region Methods