Skip to main content

Mathematical Infrastructure

The Hessian Matrix

The matrix of second partial derivatives: encodes curvature, classifies nondegenerate critical points (and is inconclusive at degenerate ones), and is the central object in second-order optimization.

CoreTier 1StableCore spine~40 min

Why This Matters

The gradient tells you which way the function changes fastest. The Hessian tells you how that direction changes nearby: bowl, dome, ridge, valley, or saddle. Every second-order optimization method depends on the Hessian or an approximation to it. In deep learning, the eigenvalues of the Hessian help diagnose curvature, saddle regions, sharpness, and why a step size that works in one direction can explode in another.

The Hessian is the point where calculus, matrix operations, and optimization become the same object.

theorem visual

The Hessian turns curvature into linear algebra

Use it as a local model: classify critical points, read directional curvature, and understand why second-order methods rescale gradients.

The Hessian enters the second-order Taylor model through the quadratic form.

directional curvaturesteepflat
local quadratic model
diagonalize curvature, then read each direction separately

quadratic term

This is directional curvature. Eigenvectors give principal directions; eigenvalues give curvature along them.

Rayleigh quotient

The quotient turns a step direction into a scalar curvature measurement.

Mental Model

For a function of one variable, f(x)f''(x) tells you the curvature: positive means concave up (bowl), negative means concave down (dome). The Hessian is the multivariable generalization. But in multiple dimensions, the curvature can be different in different directions. For a C2C^2 scalar function the Hessian is symmetric (Schwarz's theorem, see below); under that symmetry the Rayleigh quotient turns the Hessian into a quadratic form, so its eigenvalues are the extremal directional curvatures and its eigenvectors are the principal-curvature directions. This eigenvalue / principal-curvature reading depends on the symmetric quadratic-form setting; do not transfer it verbatim to nonsymmetric Jacobians of vector fields, generalized Clarke / nonsmooth Hessians, or asymmetric finite-difference approximations.

Core Definitions

Definition

Hessian Matrix

For a twice-differentiable function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Hessian matrix at a point xx is the n×nn \times n matrix of second partial derivatives:

[Hf(x)]ij=2fxixj[H_f(x)]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Explicitly:

Hf(x)=(2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2)H_f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}

Definition

Symmetry of the Hessian (Schwarz's Theorem)

If the second partial derivatives of ff are continuous (i.e., fC2f \in C^2), then the mixed partials are equal:

2fxixj=2fxjxi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}

This means the Hessian is a symmetric matrix: Hf=HfTH_f = H_f^T. Consequently, the Hessian has real eigenvalues and orthogonal eigenvectors. All the machinery of symmetric matrix theory applies.

Definition

Directional Curvature

For a direction h0h \neq 0, the scalar

hHf(x)hh22\frac{h^\top H_f(x)h}{\|h\|_2^2}

is the curvature of the second-order Taylor model in that direction. For a symmetric Hessian, the smallest and largest possible directional curvatures are the smallest and largest eigenvalues of Hf(x)H_f(x). This is the Rayleigh quotient view: eigenvalues are not decorative facts; they are the extremal curvatures the optimizer can feel.

Second-Order Taylor Expansion

The Hessian appears in the second-order Taylor expansion of ff around a point xx:

f(x+h)f(x)+f(x)Th+12hTHf(x)hf(x + h) \approx f(x) + \nabla f(x)^T h + \frac{1}{2} h^T H_f(x) \, h

The three terms have clear meanings:

  • f(x)f(x): the value at the current point
  • f(x)Th\nabla f(x)^T h: the linear (first-order) change. The gradient tells you the slope
  • 12hTHf(x)h\frac{1}{2} h^T H_f(x) \, h: the quadratic (second-order) change. The Hessian tells you the curvature

This quadratic approximation is the basis for Newton's method and for classifying critical points.

The Second Derivative Test

Theorem

Second Derivative Test (Multivariate)

Statement

Let xx^* be a critical point of ff (i.e., f(x)=0\nabla f(x^*) = 0). Then:

  • If Hf(x)H_f(x^*) is positive definite (all eigenvalues >0> 0), then xx^* is a strict local minimum.
  • If Hf(x)H_f(x^*) is negative definite (all eigenvalues <0< 0), then xx^* is a strict local maximum.
  • If Hf(x)H_f(x^*) is indefinite (has both positive and negative eigenvalues), then xx^* is a saddle point.
  • If Hf(x)H_f(x^*) is positive (or negative) semidefinite (some eigenvalue is zero), the test is inconclusive: higher-order terms are needed.

Intuition

At a critical point, f=0\nabla f = 0, so the Taylor expansion becomes f(x+h)f(x)+12hTHhf(x^* + h) \approx f(x^*) + \frac{1}{2} h^T H h. If HH is positive definite, the quadratic form hTHh>0h^T H h > 0 for all h0h \neq 0, so ff increases in every direction from xx^*. It is a minimum. If HH is indefinite, ff increases in some directions and decreases in others: a saddle point.

Proof Sketch

At the critical point xx^*, Taylor's theorem with remainder gives:

f(x+h)=f(x)+12hTHf(x)h+o(h2)f(x^* + h) = f(x^*) + \frac{1}{2} h^T H_f(x^*) h + o(\|h\|^2)

If Hf(x)H_f(x^*) is positive definite with minimum eigenvalue λmin>0\lambda_{\min} > 0, then hTHhλminh2h^T H h \geq \lambda_{\min} \|h\|^2, so for sufficiently small hh:

f(x+h)f(x)+λmin2h2+o(h2)>f(x)f(x^* + h) \geq f(x^*) + \frac{\lambda_{\min}}{2}\|h\|^2 + o(\|h\|^2) > f(x^*)

The o(h2)o(\|h\|^2) term is dominated by the quadratic term for small h\|h\|. The indefinite case follows by choosing hh along eigenvectors with positive and negative eigenvalues.

Why It Matters

This test is how you verify that a critical point found by setting f=0\nabla f = 0 is actually a minimum (or maximum or saddle). In optimization, you need to know that your converged solution is a local minimum, not a saddle point. In deep learning, this reveals the structure of the loss landscape: are we stuck at a saddle point, or have we found a genuine minimum?

Failure Mode

The test is inconclusive when HH is semidefinite (has a zero eigenvalue). Example: f(x,y)=x4+y2f(x, y) = x^4 + y^2 at the origin has H=diag(0,2)H = \text{diag}(0, 2), which is positive semidefinite. The origin is a minimum, but the second derivative test cannot confirm this: you need to examine the fourth-order term. Also, the test is local: a positive definite Hessian at xx^* does not guarantee xx^* is a global minimum.

The Hessian in Optimization

Newton's Method

Newton's method for minimizing ff uses the Hessian directly. At each iteration:

xt+1=xt[Hf(xt)]1f(xt)x_{t+1} = x_t - [H_f(x_t)]^{-1} \nabla f(x_t)

This is equivalent to minimizing the quadratic Taylor approximation at each step. When ff is quadratic, Newton's method converges in one step. For general smooth functions near a minimum, Newton's method converges quadratically (doubling the number of correct digits per iteration).

The cost: computing and inverting the n×nn \times n Hessian is O(n2)O(n^2) storage and O(n3)O(n^3) computation. For deep networks with millions of parameters, this is prohibitive.

Quasi-Newton Methods

Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian using only gradient information. L-BFGS stores a low-rank approximation using the last mm gradient differences, requiring only O(mn)O(mn) storage. These methods can achieve superlinear convergence on smooth problems: faster than gradient descent, slower than Newton, and at a fraction of the storage cost.

Hessian-Vector Products

You often do not need the full Hessian, only its action on a vector vv: Hf(x)vH_f(x)v. A Hessian-vector product avoids storing the n×nn \times n matrix. The finite-difference identity is:

Hf(x)v=limϵ0f(x+ϵv)f(x)ϵH_f(x) \cdot v = \lim_{\epsilon \to 0} \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}

In automatic differentiation, Pearlmutter's trick computes the exact Hessian-vector product as a Jacobian-vector product of the gradient, usually within a small constant factor of a gradient evaluation. Hessian-vector products enable Krylov methods, Newton-CG, influence-function approximations, and spectral diagnostics without materializing the full Hessian.

Hessian Eigenvalues and the Loss Landscape

In deep learning, the Hessian of the loss with respect to the parameters reveals the geometry of the loss landscape:

  • Eigenvalue spectrum: the distribution of Hessian eigenvalues tells you about curvature. A few large eigenvalues with many near zero suggests a low-dimensional structure in the loss landscape.
  • Sharp vs. flat minima: large eigenvalues mean the loss rises quickly in some directions. Small eigenvalues mean locally flat directions. The connection between flatness and generalization is useful but not invariant under all reparameterizations, so treat it as a diagnostic, not a theorem.
  • Saddle points: high-dimensional non-convex objectives contain many saddle-like regions. A random symmetric matrix is indefinite with high probability, which makes saddle behavior the natural baseline rather than an exotic failure case. The ratio of negative eigenvalues to total eigenvalues is the index of the saddle point.
Hessian signalWhat it suggestsML consequence
Large positive top eigenvalueA steep directionStep size may need clipping, damping, or normalization
Many near-zero eigenvaluesFlat or weakly identified directionsParameters can move with little loss change
Negative eigenvaluesLocal descent direction from a critical pointThe point is not a local minimum
Large condition numberCurvature varies by directionFirst-order methods zigzag; Newton systems need damping

Canonical Examples

Example

Hessian of a quadratic form

Let f(x)=12xTAx+bTx+cf(x) = \frac{1}{2} x^T A x + b^T x + c where AA is symmetric. The gradient is f(x)=Ax+b\nabla f(x) = Ax + b, and the Hessian is:

Hf(x)=AH_f(x) = A

The Hessian is constant, independent of xx. The function is convex if and only if AA is positive semidefinite. For quadratics, the curvature is the same everywhere, which is why Newton's method converges in one step.

Example

Hessian of a simple two-variable function

Let f(x,y)=x2y+y3f(x, y) = x^2 y + y^3.

Gradient: f=(2xy,  x2+3y2)T\nabla f = (2xy, \; x^2 + 3y^2)^T.

Second partial derivatives:

  • 2fx2=2y\frac{\partial^2 f}{\partial x^2} = 2y
  • 2fxy=2x\frac{\partial^2 f}{\partial x \partial y} = 2x
  • 2fy2=6y\frac{\partial^2 f}{\partial y^2} = 6y

Hessian:

Hf(x,y)=(2y2x2x6y)H_f(x, y) = \begin{pmatrix} 2y & 2x \\ 2x & 6y \end{pmatrix}

At the origin (0,0)(0,0): Hf=(0000)H_f = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}, the zero matrix. The second derivative test is inconclusive. (Indeed, the origin is a degenerate critical point.)

At (0,1)(0, 1): Hf=(2006)H_f = \begin{pmatrix} 2 & 0 \\ 0 & 6 \end{pmatrix} is positive definite, but this does not make (0,1)(0,1) a local minimum because f(0,1)=(0,3)0\nabla f(0,1) = (0, 3)^\top \neq 0. The second derivative test only classifies critical points. Away from a critical point, the gradient still dominates the local behavior.

Common Confusions

Watch Out

The Hessian is NOT the same as the outer product of gradients

In deep learning, the Fisher information matrix

E[logp(logp)T]\mathbb{E}[\nabla \log p \cdot (\nabla \log p)^T]

is sometimes confused with the Hessian. They are different objects. The Hessian involves second derivatives of a single function; the Fisher involves first derivatives averaged over data. They coincide only under specific conditions (e.g., for the negative log-likelihood of an exponential family at the true parameters).

Watch Out

Positive definite Hessian at a point does not mean global convexity

Hf(x)0H_f(x^*) \succ 0 means ff is locally convex near xx^*. The function could be non-convex elsewhere. Global convexity requires Hf(x)0H_f(x) \succeq 0 for all xx, which is a much stronger condition.

Watch Out

Positive definite Hessian does not classify a non-critical point

The second derivative test starts after f(x)=0\nabla f(x^*)=0. If the gradient is nonzero, a positive definite Hessian only says the local quadratic model curves upward. The function can still decrease along the negative gradient direction, so there is no local-minimum conclusion.

Watch Out

The Hessian exists but may be useless to compute explicitly

For a function of nn variables, the Hessian is an n×nn \times n matrix. For a neural network with n=108n = 10^8 parameters, the Hessian has 101610^{16} entries. It cannot be stored, let alone inverted. This is why Hessian-vector products and low-rank approximations (L-BFGS, Kronecker-factored approximations) are standard in practice.

Summary

  • The Hessian Hf(x)H_f(x) is the n×nn \times n matrix of second partial derivatives: [H]ij=2f/xixj[H]_{ij} = \partial^2 f / \partial x_i \partial x_j
  • Schwarz's theorem: if fC2f \in C^2, the Hessian is symmetric
  • Second-order Taylor: f(x+h)f(x)+fTh+12hTHhf(x+h) \approx f(x) + \nabla f^T h + \frac{1}{2} h^T H h
  • Second derivative test: positive definite \Rightarrow local min, negative definite \Rightarrow local max, indefinite \Rightarrow saddle
  • Newton's method: xt+1=xtH1fx_{t+1} = x_t - H^{-1} \nabla f uses the Hessian directly, converges quadratically
  • Hessian-vector products can be computed without storing the full Hessian
  • Hessian eigenvalues reveal loss landscape geometry: sharp vs. flat minima, saddle point structure

Exercises

ExerciseCore

Problem

Compute the Hessian of f(x,y)=x2y+y3f(x, y) = x^2 y + y^3 at the point (1,2)(1, 2). Determine whether the Hessian at this point is positive definite, negative definite, or indefinite.

ExerciseAdvanced

Problem

At a point xx, suppose f(x)=0\nabla f(x)=0 and the Hessian has eigenvalues (3,0.2,5)(-3, 0.2, 5). What does the second derivative test conclude? What would change if the eigenvalues were (0,0.2,5)(0, 0.2, 5)?

ExerciseAdvanced

Problem

Let f(x)=Axb2f(x) = \|Ax - b\|^2 for ARm×nA \in \mathbb{R}^{m \times n} and bRmb \in \mathbb{R}^m. Compute f(x)\nabla f(x) and Hf(x)H_f(x). Under what condition on AA is the Hessian positive definite (guaranteeing a unique global minimum)?

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and 6
  • Boyd & Vandenberghe, Convex Optimization (2004), Sections 3.1-3.2 and Appendix A
  • Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 5-6
  • Pearlmutter, "Fast Exact Multiplication by the Hessian" (1994), Neural Computation 6(1)

Current:

  • Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review 60(2)
  • Ghorbani, Krishnan, Xiao, "An Investigation into Neural Net Optimization via Hessian Eigenvalue Density" (2019), arXiv:1901.10159
  • Keskar et al., "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" (2017), arXiv:1609.04836
  • Dinh et al., "Sharp Minima Can Generalize For Deep Nets" (2017), arXiv:1703.04933 (counterpoint to sharpness-based generalization claims)
  • Dauphin et al., "Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization" (2014), arXiv:1406.2572
  • Martens & Grosse, "Optimizing Neural Networks with Kronecker-Factored Approximate Curvature" (2015), arXiv:1503.05671 (K-FAC)

Next Topics

The natural next steps from the Hessian:

  • Newton's method: using the Hessian for second-order optimization, convergence theory, and practical modifications
  • Convex optimization basics: where the Hessian being positive semidefinite everywhere guarantees a global minimum
  • Automatic differentiation: how modern frameworks compute gradients and Hessian-vector products without symbolic algebra
  • Neural network optimization landscape: how curvature, saddle points, and sharpness appear in deep learning

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

11

+6 more on the derived-topics page.