Skip to main content

Numerical Optimization

Conjugate Gradient Methods

Conjugate gradient solves large symmetric positive definite linear systems using matrix-vector products, Krylov subspaces, A-conjugate directions, condition-number-dependent rates, and preconditioning.

AdvancedTier 2StableSupporting~60 min

Why This Matters

Many ML and numerical optimization problems require solving

Ax=bAx=b

where AA is too large to factor. Direct Cholesky costs cubic time and requires explicit matrix storage. Conjugate gradient (CG) uses only matrix-vector products AvAv, so it works when AA is sparse, structured, implicit, or available through Hessian-vector products.

This is why CG appears in least squares, Gaussian processes, kernel methods, natural gradients, Hessian-free optimization, and TRPO-style policy optimization.

Quadratic View

For symmetric positive definite AA, solving Ax=bAx=b is equivalent to minimizing

q(x)=(1/2)xTAxbTx.q(x)=(1/2)x^TAx-b^Tx.

The gradient is

q(x)=Axb.\nabla q(x)=Ax-b.

The residual rk=bAxkr_k=b-Ax_k is therefore the negative gradient. CG is an optimization method for this special quadratic, but it is also a linear solver.

A-Conjugacy

Definition

A-Conjugate Directions

Nonzero vectors d0,,dkd_0,\ldots,d_k are AA-conjugate when

diTAdj=0forij.d_i^T A d_j=0\quad\text{for}\quad i\ne j.

This is ordinary orthogonality under the inner product u,vA=uTAv\langle u,v\rangle_A=u^TAv.

Steepest descent uses directions that are orthogonal in the Euclidean sense only at consecutive steps. CG uses directions that are mutually orthogonal in the geometry induced by AA. That is the key reason it avoids the classic zigzag on elongated ellipses.

Krylov Subspaces

Definition

Krylov Subspace

Starting from residual r0=bAx0r_0=b-Ax_0, the order-kk Krylov subspace is Kk(A,r0)=span{r0,Ar0,,Ak1r0}\mathcal{K}_k(A,r_0)=\mathrm{span}\{r_0,Ar_0,\ldots,A^{k-1}r_0\}.

CG chooses xkx_k from x0+Kk(A,r0)x_0+\mathcal{K}_k(A,r_0). Each iteration expands the search space by one matrix-vector product.

Theorem

CG Variational Characterization

Statement

The kkth CG iterate is the unique minimizer of q(x)q(x) over the affine Krylov space:

xk=argminxx0+Kk(A,r0)q(x).x_k=\arg\min_{x\in x_0+\mathcal{K}_k(A,r_0)} q(x).

Equivalently, xkx_k minimizes the AA-norm error over that same space.

Intuition

Each iteration adds one new direction. CG picks the best point available in all directions discovered so far, not merely the best point along the newest direction.

Proof Sketch

First-order optimality over the affine subspace requires the residual to be orthogonal to the Krylov subspace. The CG recurrences maintain this orthogonality while generating an AA-conjugate basis for the same Krylov space.

The Algorithm

For SPD AA and initial guess x0x_0:

  1. Set r0=bAx0r_0=b-Ax_0 and d0=r0d_0=r_0.
  2. Compute αk=(rkTrk)/(dkTAdk)\alpha_k=(r_k^Tr_k)/(d_k^TAd_k).
  3. Set xk+1=xk+αkdkx_{k+1}=x_k+\alpha_k d_k.
  4. Set rk+1=rkαkAdkr_{k+1}=r_k-\alpha_k A d_k.
  5. Compute βk+1=(rk+1Trk+1)/(rkTrk)\beta_{k+1}=(r_{k+1}^Tr_{k+1})/(r_k^Tr_k).
  6. Set dk+1=rk+1+βk+1dkd_{k+1}=r_{k+1}+\beta_{k+1}d_k.

Each iteration uses one matrix-vector product, a few dot products, and vector saxpy operations. That is the computational win.

Main Guarantees

Theorem

Finite Termination In Exact Arithmetic

Statement

CG reaches the exact solution in at most nn iterations for an nn by nn SPD matrix. More sharply, it terminates in at most the degree of the minimal polynomial of AA relative to the initial residual.

Intuition

There can be at most nn mutually AA-conjugate nonzero directions in RnR^n. Once the search space contains the exact error direction, CG has no error left.

Proof Sketch

The CG directions are AA-conjugate and span the Krylov spaces. The variational characterization says xkx_k is optimal over the current Krylov space. When the Krylov space captures the whole invariant subspace generated by the initial residual, the error is zero.

Failure Mode

Floating-point arithmetic destroys exact conjugacy. Practical CG stops by tolerance, not by waiting for exactly nn iterations.

Theorem

Condition-Number Convergence Bound

Statement

Let κ=λmax(A)/λmin(A)\kappa=\lambda_{\max}(A)/\lambda_{\min}(A). Then

xkxA2((κ1)/(κ+1))kx0xA.\|x_k-x^*\|_A \leq 2\left((\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)\right)^k\|x_0-x^*\|_A.

Intuition

CG depends on the square root of the condition number in this worst-case bound. That is a major improvement over steepest descent on ill-conditioned quadratics.

Proof Sketch

CG error can be written as pk(A)e0p_k(A)e_0 where pkp_k is a degree-kk polynomial with pk(0)=1p_k(0)=1. The method chooses the best such polynomial for the spectrum of AA. Chebyshev polynomials give the stated worst-case bound on the spectral interval.

Why It Matters

The bound explains preconditioning: change the system so the eigenvalues cluster and the effective condition number shrinks.

Failure Mode

This bound can be pessimistic. Eigenvalue clustering often matters more than the raw condition number. If AA has only a few distinct eigenvalue clusters, CG can converge much faster than the bound predicts.

Preconditioning

Preconditioning replaces the original system by one with a better spectrum. A left preconditioner uses

M1Ax=M1b,M^{-1}Ax=M^{-1}b,

where MM is cheap to solve with and approximates AA. For SPD systems, the preconditioned method is usually written in a symmetric form so the CG assumptions remain valid.

Good preconditioners reduce iterations. Bad preconditioners add solve cost without improving the spectrum enough.

PreconditionerUseful whenTradeoff
diagonal / Jacobidiagonal scaling is poorcheap, weak
incomplete Choleskysparse SPD systemsstronger, setup cost
low-rank plus diagonalkernel or Hessian approximationsuses structure
multigridPDE-like systemsvery strong when geometry fits

Nonlinear Conjugate Gradient

For a general smooth objective f(x)f(x), nonlinear CG replaces residuals with negative gradients and uses a line search:

dk=gk+βkdk1.d_k=-g_k+\beta_k d_{k-1}.

Two common formulas:

  • Fletcher-Reeves: βk=gk2/gk12\beta_k=\|g_k\|^2/\|g_{k-1}\|^2.
  • Polak-Ribiere: βk=gkT(gkgk1)/gk12\beta_k=g_k^T(g_k-g_{k-1})/\|g_{k-1}\|^2.

Nonlinear CG no longer has exact finite termination. Strong Wolfe line search and periodic restarts matter because conjugacy is only approximate away from quadratics.

ML Connections

  • Least squares: CG solves normal equations or better-conditioned equivalent systems when direct factorization is too expensive.
  • Gaussian processes: kernel systems can be solved by CG using fast kernel-vector products.
  • Natural gradient and TRPO: CG approximately solves Fisher-vector systems without forming the Fisher matrix.
  • Hessian-free optimization: CG solves Newton-like systems using Hessian-vector products.

Common Confusions

Watch Out

CG is not momentum

Momentum averages past gradients. CG constructs directions that are conjugate under AA. For quadratics, that geometry gives finite termination in exact arithmetic; momentum does not.

Watch Out

Standard CG requires SPD matrices

If AA is not symmetric positive definite, standard CG is not the right method. Use MINRES for symmetric indefinite systems, or GMRES/BiCGSTAB for non-symmetric systems.

Watch Out

The residual you compute may not equal the true residual forever

Floating-point recurrence updates can drift from bAxkb-Ax_k. Robust implementations occasionally recompute the true residual.

Watch Out

Normal equations can square the condition number

Applying CG to ATAx=ATbA^TAx=A^Tb may worsen conditioning. When possible, use methods designed for least squares or a better formulation.

Q&A For Mastery

Why is CG fast when it uses no Hessian inverse? It builds a sequence of Krylov subspaces from repeated matrix-vector products. Those subspaces capture spectral information about AA without factorizing AA.

What is the first debugging plot? Plot relative residual norm versus iteration on a log scale. A flat curve usually means poor conditioning, bad preconditioning, or loss of conjugacy.

When should I stop? Use a residual tolerance tied to the downstream task. Solving the linear system to machine precision is often wasted if the outer optimization or data noise dominates.

What To Remember

  • CG solves SPD systems using matrix-vector products.
  • It minimizes the quadratic over a growing Krylov subspace.
  • AA-conjugate directions prevent undoing earlier progress.
  • Exact arithmetic gives finite termination; floating point gives tolerance-based convergence.
  • Preconditioning is often the difference between useful CG and stalled CG.

Exercises

ExerciseCore

Problem

Why must dkTAdkd_k^TAd_k be positive in the CG step-size formula?

ExerciseAdvanced

Problem

If AA has exactly rr distinct eigenvalues, explain why CG can terminate in at most rr iterations in exact arithmetic.

ExerciseResearch

Problem

Why can preconditioning improve CG even if it makes each iteration more expensive?

References

Canonical:

  • Hestenes and Stiefel, "Methods of Conjugate Gradients for Solving Linear Systems" (1952).
  • Trefethen and Bau, Numerical Linear Algebra, Lectures 38-39.
  • Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 5.
  • Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., Chapters 6 and 9.

Further:

  • Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain (1994).
  • Greenbaum, Iterative Methods for Solving Linear Systems, Chapters 2-3.
  • Martens, "Deep Learning via Hessian-Free Optimization" (2010).

Next Topics

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4