Mathematical Infrastructure
The Hessian Matrix
The matrix of second partial derivatives: encodes curvature, classifies nondegenerate critical points (and is inconclusive at degenerate ones), and is the central object in second-order optimization.
Prerequisites
Why This Matters
The gradient tells you which way the function changes fastest. The Hessian tells you how that direction changes nearby: bowl, dome, ridge, valley, or saddle. Every second-order optimization method depends on the Hessian or an approximation to it. In deep learning, the eigenvalues of the Hessian help diagnose curvature, saddle regions, sharpness, and why a step size that works in one direction can explode in another.
The Hessian is the point where calculus, matrix operations, and optimization become the same object.
Mental Model
For a function of one variable, tells you the curvature: positive means concave up (bowl), negative means concave down (dome). The Hessian is the multivariable generalization. But in multiple dimensions, the curvature can be different in different directions. For a scalar function the Hessian is symmetric (Schwarz's theorem, see below); under that symmetry the Rayleigh quotient turns the Hessian into a quadratic form, so its eigenvalues are the extremal directional curvatures and its eigenvectors are the principal-curvature directions. This eigenvalue / principal-curvature reading depends on the symmetric quadratic-form setting; do not transfer it verbatim to nonsymmetric Jacobians of vector fields, generalized Clarke / nonsmooth Hessians, or asymmetric finite-difference approximations.
Core Definitions
Hessian Matrix
For a twice-differentiable function , the Hessian matrix at a point is the matrix of second partial derivatives:
Explicitly:
Symmetry of the Hessian (Schwarz's Theorem)
If the second partial derivatives of are continuous (i.e., ), then the mixed partials are equal:
This means the Hessian is a symmetric matrix: . Consequently, the Hessian has real eigenvalues and orthogonal eigenvectors. All the machinery of symmetric matrix theory applies.
Directional Curvature
For a direction , the scalar
is the curvature of the second-order Taylor model in that direction. For a symmetric Hessian, the smallest and largest possible directional curvatures are the smallest and largest eigenvalues of . This is the Rayleigh quotient view: eigenvalues are not decorative facts; they are the extremal curvatures the optimizer can feel.
Second-Order Taylor Expansion
The Hessian appears in the second-order Taylor expansion of around a point :
The three terms have clear meanings:
- : the value at the current point
- : the linear (first-order) change. The gradient tells you the slope
- : the quadratic (second-order) change. The Hessian tells you the curvature
This quadratic approximation is the basis for Newton's method and for classifying critical points.
The Second Derivative Test
Second Derivative Test (Multivariate)
Statement
Let be a critical point of (i.e., ). Then:
- If is positive definite (all eigenvalues ), then is a strict local minimum.
- If is negative definite (all eigenvalues ), then is a strict local maximum.
- If is indefinite (has both positive and negative eigenvalues), then is a saddle point.
- If is positive (or negative) semidefinite (some eigenvalue is zero), the test is inconclusive: higher-order terms are needed.
Intuition
At a critical point, , so the Taylor expansion becomes . If is positive definite, the quadratic form for all , so increases in every direction from . It is a minimum. If is indefinite, increases in some directions and decreases in others: a saddle point.
Proof Sketch
At the critical point , Taylor's theorem with remainder gives:
If is positive definite with minimum eigenvalue , then , so for sufficiently small :
The term is dominated by the quadratic term for small . The indefinite case follows by choosing along eigenvectors with positive and negative eigenvalues.
Why It Matters
This test is how you verify that a critical point found by setting is actually a minimum (or maximum or saddle). In optimization, you need to know that your converged solution is a local minimum, not a saddle point. In deep learning, this reveals the structure of the loss landscape: are we stuck at a saddle point, or have we found a genuine minimum?
Failure Mode
The test is inconclusive when is semidefinite (has a zero eigenvalue). Example: at the origin has , which is positive semidefinite. The origin is a minimum, but the second derivative test cannot confirm this: you need to examine the fourth-order term. Also, the test is local: a positive definite Hessian at does not guarantee is a global minimum.
The Hessian in Optimization
Newton's Method
Newton's method for minimizing uses the Hessian directly. At each iteration:
This is equivalent to minimizing the quadratic Taylor approximation at each step. When is quadratic, Newton's method converges in one step. For general smooth functions near a minimum, Newton's method converges quadratically (doubling the number of correct digits per iteration).
The cost: computing and inverting the Hessian is storage and computation. For deep networks with millions of parameters, this is prohibitive.
Quasi-Newton Methods
Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian using only gradient information. L-BFGS stores a low-rank approximation using the last gradient differences, requiring only storage. These methods can achieve superlinear convergence on smooth problems: faster than gradient descent, slower than Newton, and at a fraction of the storage cost.
Hessian-Vector Products
You often do not need the full Hessian, only its action on a vector : . A Hessian-vector product avoids storing the matrix. The finite-difference identity is:
In automatic differentiation, Pearlmutter's trick computes the exact Hessian-vector product as a Jacobian-vector product of the gradient, usually within a small constant factor of a gradient evaluation. Hessian-vector products enable Krylov methods, Newton-CG, influence-function approximations, and spectral diagnostics without materializing the full Hessian.
Hessian Eigenvalues and the Loss Landscape
In deep learning, the Hessian of the loss with respect to the parameters reveals the geometry of the loss landscape:
- Eigenvalue spectrum: the distribution of Hessian eigenvalues tells you about curvature. A few large eigenvalues with many near zero suggests a low-dimensional structure in the loss landscape.
- Sharp vs. flat minima: large eigenvalues mean the loss rises quickly in some directions. Small eigenvalues mean locally flat directions. The connection between flatness and generalization is useful but not invariant under all reparameterizations, so treat it as a diagnostic, not a theorem.
- Saddle points: high-dimensional non-convex objectives contain many saddle-like regions. A random symmetric matrix is indefinite with high probability, which makes saddle behavior the natural baseline rather than an exotic failure case. The ratio of negative eigenvalues to total eigenvalues is the index of the saddle point.
| Hessian signal | What it suggests | ML consequence |
|---|---|---|
| Large positive top eigenvalue | A steep direction | Step size may need clipping, damping, or normalization |
| Many near-zero eigenvalues | Flat or weakly identified directions | Parameters can move with little loss change |
| Negative eigenvalues | Local descent direction from a critical point | The point is not a local minimum |
| Large condition number | Curvature varies by direction | First-order methods zigzag; Newton systems need damping |
Canonical Examples
Hessian of a quadratic form
Let where is symmetric. The gradient is , and the Hessian is:
The Hessian is constant, independent of . The function is convex if and only if is positive semidefinite. For quadratics, the curvature is the same everywhere, which is why Newton's method converges in one step.
Hessian of a simple two-variable function
Let .
Gradient: .
Second partial derivatives:
Hessian:
At the origin : , the zero matrix. The second derivative test is inconclusive. (Indeed, the origin is a degenerate critical point.)
At : is positive definite, but this does not make a local minimum because . The second derivative test only classifies critical points. Away from a critical point, the gradient still dominates the local behavior.
Common Confusions
The Hessian is NOT the same as the outer product of gradients
In deep learning, the Fisher information matrix
is sometimes confused with the Hessian. They are different objects. The Hessian involves second derivatives of a single function; the Fisher involves first derivatives averaged over data. They coincide only under specific conditions (e.g., for the negative log-likelihood of an exponential family at the true parameters).
Positive definite Hessian at a point does not mean global convexity
means is locally convex near . The function could be non-convex elsewhere. Global convexity requires for all , which is a much stronger condition.
Positive definite Hessian does not classify a non-critical point
The second derivative test starts after . If the gradient is nonzero, a positive definite Hessian only says the local quadratic model curves upward. The function can still decrease along the negative gradient direction, so there is no local-minimum conclusion.
The Hessian exists but may be useless to compute explicitly
For a function of variables, the Hessian is an matrix. For a neural network with parameters, the Hessian has entries. It cannot be stored, let alone inverted. This is why Hessian-vector products and low-rank approximations (L-BFGS, Kronecker-factored approximations) are standard in practice.
Summary
- The Hessian is the matrix of second partial derivatives:
- Schwarz's theorem: if , the Hessian is symmetric
- Second-order Taylor:
- Second derivative test: positive definite local min, negative definite local max, indefinite saddle
- Newton's method: uses the Hessian directly, converges quadratically
- Hessian-vector products can be computed without storing the full Hessian
- Hessian eigenvalues reveal loss landscape geometry: sharp vs. flat minima, saddle point structure
Exercises
Problem
Compute the Hessian of at the point . Determine whether the Hessian at this point is positive definite, negative definite, or indefinite.
Problem
At a point , suppose and the Hessian has eigenvalues . What does the second derivative test conclude? What would change if the eigenvalues were ?
Problem
Let for and . Compute and . Under what condition on is the Hessian positive definite (guaranteeing a unique global minimum)?
References
Canonical:
- Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and 6
- Boyd & Vandenberghe, Convex Optimization (2004), Sections 3.1-3.2 and Appendix A
- Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 5-6
- Pearlmutter, "Fast Exact Multiplication by the Hessian" (1994), Neural Computation 6(1)
Current:
- Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review 60(2)
- Ghorbani, Krishnan, Xiao, "An Investigation into Neural Net Optimization via Hessian Eigenvalue Density" (2019), arXiv:1901.10159
- Keskar et al., "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" (2017), arXiv:1609.04836
- Dinh et al., "Sharp Minima Can Generalize For Deep Nets" (2017), arXiv:1703.04933 (counterpoint to sharpness-based generalization claims)
- Dauphin et al., "Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization" (2014), arXiv:1406.2572
- Martens & Grosse, "Optimizing Neural Networks with Kronecker-Factored Approximate Curvature" (2015), arXiv:1503.05671 (K-FAC)
Next Topics
The natural next steps from the Hessian:
- Newton's method: using the Hessian for second-order optimization, convergence theory, and practical modifications
- Convex optimization basics: where the Hessian being positive semidefinite everywhere guarantees a global minimum
- Automatic differentiation: how modern frameworks compute gradients and Hessian-vector products without symbolic algebra
- Neural network optimization landscape: how curvature, saddle points, and sharpness appear in deep learning
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Differentiation in Rⁿlayer 0A · tier 1
- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Matrix Operations and Propertieslayer 0A · tier 1
- The Jacobian Matrixlayer 0A · tier 1
- Vector Calculus Chain Rulelayer 0A · tier 1
Derived topics
11- Automatic Differentiationlayer 1 · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Matrix Calculuslayer 1 · tier 1
- Newton's Methodlayer 1 · tier 1
- Trust Region Methodslayer 2 · tier 2
+6 more on the derived-topics page.
Graph-backed continuations