The Hessian Matrix

Sneiderman, Robby

Mathematical Infrastructure

The Hessian Matrix

The matrix of second partial derivatives: encodes curvature, classifies nondegenerate critical points (and is inconclusive at degenerate ones), and is the central object in second-order optimization.

CoreTier 1StableCore spine~40 min

Prerequisites

Matrix Operations and Properties Eigenvalues and Eigenvectors Differentiation in Rn The Jacobian Matrix

Quiz (16)Pulse Check Prereq Map

Why This Matters

The gradient tells you which way the function changes fastest. The Hessian tells you how that direction changes nearby: bowl, dome, ridge, valley, or saddle. Every second-order optimization method depends on the Hessian or an approximation to it. In deep learning, the eigenvalues of the Hessian help diagnose curvature, saddle regions, sharpness, and why a step size that works in one direction can explode in another.

The Hessian is the point where calculus, matrix operations, and optimization become the same object.

theorem visual

The Hessian turns curvature into linear algebra

$Use it as a local model: classify critical points, read directional curvature, and understand why second-order methods rescale gradients.$

The Hessian enters the second-order Taylor model through the quadratic form.

quadratic term

\frac{1}{2} h^{⊤} H_{f} (x) h

This is directional curvature. Eigenvectors give principal directions; eigenvalues give curvature along them.

Rayleigh quotient

h^{⊤} H h /∥ h ∥_{2}^{2}

The quotient turns a step direction into a scalar curvature measurement.

Mental Model

For a function of one variable, $f''(x)$ tells you the curvature: positive means concave up (bowl), negative means concave down (dome). The Hessian is the multivariable generalization. But in multiple dimensions, the curvature can be different in different directions. For a $C^2$ scalar function the Hessian is symmetric (Schwarz's theorem, see below); under that symmetry the Rayleigh quotient turns the Hessian into a quadratic form, so its eigenvalues are the extremal directional curvatures and its eigenvectors are the principal-curvature directions. This eigenvalue / principal-curvature reading depends on the symmetric quadratic-form setting; do not transfer it verbatim to nonsymmetric Jacobians of vector fields, generalized Clarke / nonsmooth Hessians, or asymmetric finite-difference approximations.

Core Definitions

Definition

Hessian Matrix $H_{f} (x) or \nabla^{2} f (x)$

For a twice-differentiable function $f: \mathbb{R}^n \to \mathbb{R}$ , the Hessian matrix at a point $x$ is the $n \times n$ matrix of second partial derivatives:

$[H_f(x)]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

Explicitly:

$H_f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}$

Definition

Symmetry of the Hessian (Schwarz's Theorem)

If the second partial derivatives of $f$ are continuous (i.e., $f \in C^2$ ), then the mixed partials are equal:

$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$

This means the Hessian is a symmetric matrix: $H_f = H_f^T$ . Consequently, the Hessian has real eigenvalues and orthogonal eigenvectors. All the machinery of symmetric matrix theory applies.

Definition

Directional Curvature $h^{⊤} H_{f} (x) h$

For a direction $h \neq 0$ , the scalar

$\frac{h^\top H_f(x)h}{\|h\|_2^2}$

is the curvature of the second-order Taylor model in that direction. For a symmetric Hessian, the smallest and largest possible directional curvatures are the smallest and largest eigenvalues of $H_f(x)$ . This is the Rayleigh quotient view: eigenvalues are not decorative facts; they are the extremal curvatures the optimizer can feel.

Second-Order Taylor Expansion

The Hessian appears in the second-order Taylor expansion of $f$ around a point $x$ :

$f(x + h) \approx f(x) + \nabla f(x)^T h + \frac{1}{2} h^T H_f(x) \, h$

The three terms have clear meanings:

$f(x)$ : the value at the current point
$\nabla f(x)^T h$ : the linear (first-order) change. The gradient tells you the slope
$\frac{1}{2} h^T H_f(x) \, h$ : the quadratic (second-order) change. The Hessian tells you the curvature

This quadratic approximation is the basis for Newton's method and for classifying critical points.

The Second Derivative Test

Theorem

Second Derivative Test (Multivariate)

Statement

Let $x^*$ be a critical point of $f$ (i.e., $\nabla f(x^*) = 0$ ). Then:

If $H_f(x^*)$ is positive definite (all eigenvalues $> 0$ ), then $x^*$ is a strict local minimum.
If $H_f(x^*)$ is negative definite (all eigenvalues $< 0$ ), then $x^*$ is a strict local maximum.
If $H_f(x^*)$ is indefinite (has both positive and negative eigenvalues), then $x^*$ is a saddle point.
If $H_f(x^*)$ is positive (or negative) semidefinite (some eigenvalue is zero), the test is inconclusive: higher-order terms are needed.

Intuition

At a critical point, $\nabla f = 0$ , so the Taylor expansion becomes $f(x^* + h) \approx f(x^*) + \frac{1}{2} h^T H h$ . If $H$ is positive definite, the quadratic form $h^T H h > 0$ for all $h \neq 0$ , so $f$ increases in every direction from $x^*$ . It is a minimum. If $H$ is indefinite, $f$ increases in some directions and decreases in others: a saddle point.

Proof Sketch

At the critical point $x^*$ , Taylor's theorem with remainder gives:

$f(x^* + h) = f(x^*) + \frac{1}{2} h^T H_f(x^*) h + o(\|h\|^2)$

If $H_f(x^*)$ is positive definite with minimum eigenvalue $\lambda_{\min} > 0$ , then $h^T H h \geq \lambda_{\min} \|h\|^2$ , so for sufficiently small $h$ :

$f(x^* + h) \geq f(x^*) + \frac{\lambda_{\min}}{2}\|h\|^2 + o(\|h\|^2) > f(x^*)$

The $o(\|h\|^2)$ term is dominated by the quadratic term for small $\|h\|$ . The indefinite case follows by choosing $h$ along eigenvectors with positive and negative eigenvalues.

Why It Matters

This test is how you verify that a critical point found by setting $\nabla f = 0$ is actually a minimum (or maximum or saddle). In optimization, you need to know that your converged solution is a local minimum, not a saddle point. In deep learning, this reveals the structure of the loss landscape: are we stuck at a saddle point, or have we found a genuine minimum?

Failure Mode

The test is inconclusive when $H$ is semidefinite (has a zero eigenvalue). Example: $f(x, y) = x^4 + y^2$ at the origin has $H = \text{diag}(0, 2)$ , which is positive semidefinite. The origin is a minimum, but the second derivative test cannot confirm this: you need to examine the fourth-order term. Also, the test is local: a positive definite Hessian at $x^*$ does not guarantee $x^*$ is a global minimum.

report a correction →

The Hessian in Optimization

Newton's Method

Newton's method for minimizing $f$ uses the Hessian directly. At each iteration:

$x_{t+1} = x_t - [H_f(x_t)]^{-1} \nabla f(x_t)$

This is equivalent to minimizing the quadratic Taylor approximation at each step. When $f$ is quadratic, Newton's method converges in one step. For general smooth functions near a minimum, Newton's method converges quadratically (doubling the number of correct digits per iteration).

The cost: computing and inverting the $n \times n$ Hessian is $O(n^2)$ storage and $O(n^3)$ computation. For deep networks with millions of parameters, this is prohibitive.

Quasi-Newton Methods

Quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian using only gradient information. L-BFGS stores a low-rank approximation using the last $m$ gradient differences, requiring only $O(mn)$ storage. These methods can achieve superlinear convergence on smooth problems: faster than gradient descent, slower than Newton, and at a fraction of the storage cost.

Hessian-Vector Products

You often do not need the full Hessian, only its action on a vector $v$ : $H_f(x)v$ . A Hessian-vector product avoids storing the $n \times n$ matrix. The finite-difference identity is:

$H_f(x) \cdot v = \lim_{\epsilon \to 0} \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}$

In automatic differentiation, Pearlmutter's trick computes the exact Hessian-vector product as a Jacobian-vector product of the gradient, usually within a small constant factor of a gradient evaluation. Hessian-vector products enable Krylov methods, Newton-CG, influence-function approximations, and spectral diagnostics without materializing the full Hessian.

Hessian Eigenvalues and the Loss Landscape

In deep learning, the Hessian of the loss with respect to the parameters reveals the geometry of the loss landscape:

Eigenvalue spectrum: the distribution of Hessian eigenvalues tells you about curvature. A few large eigenvalues with many near zero suggests a low-dimensional structure in the loss landscape.
Sharp vs. flat minima: large eigenvalues mean the loss rises quickly in some directions. Small eigenvalues mean locally flat directions. The connection between flatness and generalization is useful but not invariant under all reparameterizations, so treat it as a diagnostic, not a theorem.
Saddle points: high-dimensional non-convex objectives contain many saddle-like regions. A random symmetric matrix is indefinite with high probability, which makes saddle behavior the natural baseline rather than an exotic failure case. The ratio of negative eigenvalues to total eigenvalues is the index of the saddle point.

Hessian signal	What it suggests	ML consequence
Large positive top eigenvalue	A steep direction	Step size may need clipping, damping, or normalization
Many near-zero eigenvalues	Flat or weakly identified directions	Parameters can move with little loss change
Negative eigenvalues	Local descent direction from a critical point	The point is not a local minimum
Large condition number	Curvature varies by direction	First-order methods zigzag; Newton systems need damping

Canonical Examples

Example

Hessian of a quadratic form

Let $f(x) = \frac{1}{2} x^T A x + b^T x + c$ where $A$ is symmetric. The gradient is $\nabla f(x) = Ax + b$ , and the Hessian is:

$H_f(x) = A$

The Hessian is constant, independent of $x$ . The function is convex if and only if $A$ is positive semidefinite. For quadratics, the curvature is the same everywhere, which is why Newton's method converges in one step.

Example

Hessian of a simple two-variable function

Let $f(x, y) = x^2 y + y^3$ .

Gradient: $\nabla f = (2xy, \; x^2 + 3y^2)^T$ .

Second partial derivatives:

$\frac{\partial^2 f}{\partial x^2} = 2y$
$\frac{\partial^2 f}{\partial x \partial y} = 2x$
$\frac{\partial^2 f}{\partial y^2} = 6y$

Hessian:

$H_f(x, y) = \begin{pmatrix} 2y & 2x \\ 2x & 6y \end{pmatrix}$

At the origin $(0,0)$ : $H_f = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}$ , the zero matrix. The second derivative test is inconclusive. (Indeed, the origin is a degenerate critical point.)

At $(0, 1)$ : $H_f = \begin{pmatrix} 2 & 0 \\ 0 & 6 \end{pmatrix}$ is positive definite, but this does not make $(0,1)$ a local minimum because $\nabla f(0,1) = (0, 3)^\top \neq 0$ . The second derivative test only classifies critical points. Away from a critical point, the gradient still dominates the local behavior.

Common Confusions

Watch Out

The Hessian is NOT the same as the outer product of gradients

In deep learning, the Fisher information matrix

$\mathbb{E}[\nabla \log p \cdot (\nabla \log p)^T]$

is sometimes confused with the Hessian. They are different objects. The Hessian involves second derivatives of a single function; the Fisher involves first derivatives averaged over data. They coincide only under specific conditions (e.g., for the negative log-likelihood of an exponential family at the true parameters).

Watch Out

Positive definite Hessian at a point does not mean global convexity

$H_f(x^*) \succ 0$ means $f$ is locally convex near $x^*$ . The function could be non-convex elsewhere. Global convexity requires $H_f(x) \succeq 0$ for all $x$ , which is a much stronger condition.

Watch Out

Positive definite Hessian does not classify a non-critical point

The second derivative test starts after $\nabla f(x^*)=0$ . If the gradient is nonzero, a positive definite Hessian only says the local quadratic model curves upward. The function can still decrease along the negative gradient direction, so there is no local-minimum conclusion.

Watch Out

The Hessian exists but may be useless to compute explicitly

For a function of $n$ variables, the Hessian is an $n \times n$ matrix. For a neural network with $n = 10^8$ parameters, the Hessian has $10^{16}$ entries. It cannot be stored, let alone inverted. This is why Hessian-vector products and low-rank approximations (L-BFGS, Kronecker-factored approximations) are standard in practice.

Summary

The Hessian $H_f(x)$ is the $n \times n$ matrix of second partial derivatives: $[H]_{ij} = \partial^2 f / \partial x_i \partial x_j$
Schwarz's theorem: if $f \in C^2$ , the Hessian is symmetric
Second-order Taylor: $f(x+h) \approx f(x) + \nabla f^T h + \frac{1}{2} h^T H h$
Second derivative test: positive definite $\Rightarrow$ local min, negative definite $\Rightarrow$ local max, indefinite $\Rightarrow$ saddle
Newton's method: $x_{t+1} = x_t - H^{-1} \nabla f$ uses the Hessian directly, converges quadratically
Hessian-vector products can be computed without storing the full Hessian
Hessian eigenvalues reveal loss landscape geometry: sharp vs. flat minima, saddle point structure

Exercises

ExerciseCore

Problem

Compute the Hessian of $f(x, y) = x^2 y + y^3$ at the point $(1, 2)$ . Determine whether the Hessian at this point is positive definite, negative definite, or indefinite.

ExerciseAdvanced

Problem

At a point $x$ , suppose $\nabla f(x)=0$ and the Hessian has eigenvalues $(-3, 0.2, 5)$ . What does the second derivative test conclude? What would change if the eigenvalues were $(0, 0.2, 5)$ ?

ExerciseAdvanced

Problem

Let $f(x) = \|Ax - b\|^2$ for $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$ . Compute $\nabla f(x)$ and $H_f(x)$ . Under what condition on $A$ is the Hessian positive definite (guaranteeing a unique global minimum)?

References

Canonical:

Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and 6
Boyd & Vandenberghe, Convex Optimization (2004), Sections 3.1-3.2 and Appendix A
Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 5-6
Pearlmutter, "Fast Exact Multiplication by the Hessian" (1994), Neural Computation 6(1)

Current:

Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review 60(2)
Ghorbani, Krishnan, Xiao, "An Investigation into Neural Net Optimization via Hessian Eigenvalue Density" (2019), arXiv:1901.10159
Keskar et al., "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" (2017), arXiv:1609.04836
Dinh et al., "Sharp Minima Can Generalize For Deep Nets" (2017), arXiv:1703.04933 (counterpoint to sharpness-based generalization claims)
Dauphin et al., "Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization" (2014), arXiv:1406.2572
Martens & Grosse, "Optimizing Neural Networks with Kronecker-Factored Approximate Curvature" (2015), arXiv:1503.05671 (K-FAC)

Next Topics

The natural next steps from the Hessian:

Newton's method: using the Hessian for second-order optimization, convergence theory, and practical modifications
Convex optimization basics: where the Hessian being positive semidefinite everywhere guarantees a global minimum
Automatic differentiation: how modern frameworks compute gradients and Hessian-vector products without symbolic algebra
Neural network optimization landscape: how curvature, saddle points, and sharpness appear in deep learning

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Differentiation in Rⁿlayer 0A · tier 1
Eigenvalues and Eigenvectorslayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1
The Jacobian Matrixlayer 0A · tier 1
Vector Calculus Chain Rulelayer 0A · tier 1

Derived topics

11

Automatic Differentiationlayer 1 · tier 1
Convex Optimization Basicslayer 1 · tier 1
Matrix Calculuslayer 1 · tier 1
Newton's Methodlayer 1 · tier 1
Trust Region Methodslayer 2 · tier 2

+6 more on the derived-topics page.

Graph-backed continuations

Newton's Method Convex Optimization Basics Automatic Differentiation Neural Network Optimization Landscape Matrix Calculus Optimal Brain Surgeon and Pruning Theory Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient Riemannian Optimization and Manifold Constraints Second-Order Optimization Methods Training Dynamics and Loss Landscapes Trust Region Methods