Skip to main content

Numerical Optimization

Nonlinear Gauss-Seidel

Block coordinate optimization with latest-value updates. Nonlinear Gauss-Seidel explains alternating minimization, EM-style sweeps, and why block coupling decides whether cycling, stalling, or convergence happens.

AdvancedTier 3StableSupporting~45 min

Why This Matters

Many ML objectives are not one smooth blob. They have blocks: latent assignments and parameters, encoder and decoder variables, matrix factors WW and HH, primal and dual variables, or local worker models and a global consensus variable.

Nonlinear Gauss-Seidel is the clean mathematical pattern behind many alternating algorithms:

  • update block 1 while the other blocks are fixed
  • immediately use that new value when updating block 2
  • continue through the blocks
  • repeat until the sweep stops changing the objective

This page is not about a trendy optimizer. It is about recognizing when an algorithm is a block fixed-point method and asking the right questions: Are blocks weakly coupled? Are subproblems solved accurately enough? Is the objective decreasing? Can the method cycle?

Mental Model

Coordinate descent changes one coordinate or one block at a time. Jacobi-style methods update blocks in parallel from stale values. Gauss-Seidel-style methods update blocks sequentially and reuse the newest values immediately.

That one detail matters. If the first block update dramatically changes the best response of the second block, the next block may undo much of the previous work. If blocks are weakly coupled, a sweep behaves like a stable contraction. The practical question is not "is this block algorithm elegant?" The question is "does each block update make the other blocks easier or harder?"

Formal Setup

Definition

Block Decomposition

Write xRnx \in \mathbb{R}^n as x=(x1,,xm)x=(x_1,\ldots,x_m) with xiRnix_i \in \mathbb{R}^{n_i} and ini=n\sum_i n_i=n. For a function f(x1,,xm)f(x_1,\ldots,x_m), block ii means one group of variables is allowed to move while the other groups are held fixed.

Definition

Nonlinear Gauss-Seidel Sweep

Given xk=(x1k,,xmk)x^k=(x_1^k,\ldots,x_m^k), one exact nonlinear Gauss-Seidel sweep computes

xik+1argminuif(x1k+1,,xi1k+1,ui,xi+1k,,xmk)x_i^{k+1}\in \arg\min_{u_i} f(x_1^{k+1},\ldots,x_{i-1}^{k+1},u_i,x_{i+1}^{k},\ldots,x_m^{k})

for i=1,,mi=1,\ldots,m. Earlier blocks use the current-sweep values x1k+1,,xi1k+1x_1^{k+1},\ldots,x_{i-1}^{k+1}; later blocks still use old values.

In practice, the block subproblem may be solved approximately by Newton, quasi-Newton, proximal, closed-form least squares, dynamic programming, or a few inner gradient steps. That makes the method an inexact nonlinear Gauss-Seidel method.

Main Guarantees

Proposition

Exact Block Minimization Gives Monotone Descent

Statement

For an exact minimization sweep,

f(x1k+1,,xik+1,xi+1k,,xmk)f(x1k+1,,xi1k+1,xik,,xmk)f(x_1^{k+1},\ldots,x_i^{k+1},x_{i+1}^{k},\ldots,x_m^{k}) \leq f(x_1^{k+1},\ldots,x_{i-1}^{k+1},x_i^{k},\ldots,x_m^{k})

for every block ii. Therefore f(xk+1)f(xk)f(x^{k+1})\leq f(x^k) after the full sweep. If the level set is compact and the block-update limit points satisfy the needed first-order regularity conditions, every accumulation point is stationary. Global optimality needs additional convexity.

Intuition

Each block is allowed to pick the best value given the current values of the other blocks. That cannot increase the objective. But monotone decrease alone does not tell you which stationary point you reach, or whether the blocks stop fighting each other.

Proof Sketch

Apply the definition of the block argmin one block at a time and telescope the inequalities through the sweep. Stationarity of limit points comes from the first-order optimality conditions for each block subproblem plus continuity/closedness assumptions that let those conditions pass to a limit.

Why It Matters

This is the safety check for alternating minimization. If a proposed block update does not decrease the objective or a surrogate objective, you should not assume Gauss-Seidel theory is protecting it.

Failure Mode

In non-convex problems the method can converge to bad stationary points. If the subproblem minimizer is not unique, different tie-breaking rules can produce different trajectories. If block updates are only approximate, the approximation error must be controlled.

Theorem

Contraction Gives Linear Convergence

Statement

Let GG be the full sweep map, so xk+1=G(xk)x^{k+1}=G(x^k). If there exists ρ<1\rho<1 such that

G(x)G(y)ρxy\|G(x)-G(y)\|\leq \rho \|x-y\|

on a closed region containing the iterates, then GG has a unique fixed point xx^* in that region and

xkxρkx0x.\|x^k-x^*\|\leq \rho^k\|x^0-x^*\|.

Intuition

The sweep is a function from "old blocks" to "new blocks." If every sweep compresses distances, the iterates cannot wander, cycle, or split into multiple possible limits inside that region.

Proof Sketch

This is the Banach fixed-point theorem applied to the complete metric space containing the iterates. The fixed point of the sweep satisfies the block optimality conditions.

Why It Matters

Contraction is usually too strong to prove globally, but it gives the right engineering diagnostic: strong cross-block curvature makes alternating methods fragile; weak cross-block curvature makes them stable.

How To Diagnose Coupling

Near a smooth local minimizer, partition the Hessian into blocks. A rough coupling lens is:

Hii1Hij.\|H_{ii}^{-1}H_{ij}\|.

Small off-diagonal influence relative to within-block curvature suggests block jj does not wildly move the best response of block ii. Large values suggest a sweep may zigzag or need damping.

SignalWhat it suggestsPractical response
objective decreases smoothlyblocks are cooperatingkeep exact or high-quality block solves
objective decreases but iterates oscillatestrong cross-block couplingchange block order, add damping, or merge blocks
subproblems are expensiveexact Gauss-Seidel is too costlyuse inexact solves with descent checks
many local optimanon-convex alternating landscapeuse restarts and monitor validation behavior

Connections In ML

Alternating Matrix Factorization

For nonnegative matrix factorization or low-rank factorization, update WW with HH fixed, then update HH with WW fixed. Each subproblem may be easier than the joint problem, but the joint problem is still non-convex.

EM Is Related, But Not Identical

Expectation-maximization alternates between a distribution over latent variables and model parameters by maximizing an evidence lower bound:

L(q,θ)=Eq[logp(x,zθ)]+H(q).\mathcal{L}(q,\theta)=E_q[\log p(x,z\mid \theta)] + H(q).

The E-step optimizes qq with θ\theta fixed. The M-step optimizes θ\theta with qq fixed. That is a block-coordinate ascent pattern on a surrogate objective. EM guarantees monotone likelihood improvement under the standard exact-step setup, but EM is not generally a contraction and is not guaranteed to find the global maximum.

ADMM Adds A Dual Memory

ADMM also alternates over blocks, but it is not merely block coordinate descent on the original objective. The augmented Lagrangian and dual update carry constraint violation forward. That residual memory is the key extra mechanism.

Common Confusions

Watch Out

Monotone decrease is not convergence to the best solution

The objective can decrease every sweep and still converge to a poor local stationary point. For non-convex matrix factorization, topic models, and mixture models, initialization and block design matter.

Watch Out

Gauss-Seidel is not Jacobi

Gauss-Seidel uses the newest block values inside the same sweep. Jacobi updates all blocks from old values. Jacobi parallelizes better; Gauss-Seidel often stabilizes better because it uses fresher information.

Watch Out

Exact block minimization is not always better in wall-clock time

An exact block solve may be expensive. If approximate solves give enough descent, inexact Gauss-Seidel can beat exact Gauss-Seidel in real compute time.

Watch Out

EM is not a generic convergence certificate for every alternating method

EM has a special variational identity that ensures likelihood monotonicity. A random alternating heuristic does not inherit that identity just because it has two blocks.

Q&A For Mastery

If every block update minimizes its subproblem, why can the method still fail? Because minimizing one block can make another block worse. The objective may still descend, but the iterates can zigzag, stall at a saddle-like stationary point, or converge to a local solution determined by initialization.

What should I check first in code? Log the objective after every block update, not only after a full epoch. If a block update increases the objective without an intentional stochastic or approximate reason, the implementation is not matching the theory.

When should I merge blocks? When two blocks are so strongly coupled that each one repeatedly undoes the other. Merging increases subproblem cost but can remove the unstable feedback loop.

What To Remember

  • Nonlinear Gauss-Seidel is sequential block optimization with newest-value updates.
  • Exact block minimization gives monotone objective decrease, not automatic global optimality.
  • Contraction gives linear convergence, but it is a strong local condition.
  • Cross-block coupling is the central diagnostic.
  • EM and ADMM look like alternating methods, but each has extra structure that must be respected.

Exercises

ExerciseCore

Problem

For f(x,y)=x2+y2+12xyf(x,y)=x^2+y^2+\frac{1}{2}xy, compute one Gauss-Seidel sweep from (x0,y0)=(2,2)(x^0,y^0)=(2,2).

ExerciseAdvanced

Problem

Explain why the same example would give a different first point under Jacobi updating.

ExerciseResearch

Problem

In a Gaussian mixture EM run, what are the two blocks? Why does monotone likelihood improvement not imply the EM map is a contraction?

References

Canonical:

  • Bertsekas, Nonlinear Programming, 3rd ed., Sections 2.7 and 2.8.
  • Ortega and Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Chapter 10.
  • Dempster, Laird, and Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (1977).

Further:

  • Wright, "Coordinate Descent Algorithms" (2015).
  • Xu and Yin, "A Block Coordinate Descent Method for Regularized Multi-Convex Optimization" (2013).
  • Wu, "On the Convergence Properties of the EM Algorithm" (1983).

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

1

Graph-backed continuations