Gradient Flow and Vanishing Gradients

Sneiderman, Robby

Optimization Function Classes

Gradient Flow and Vanishing Gradients

Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping.

CoreTier 1StableSupporting~50 min

Prerequisites

Feedforward Networks and Backpropagation The Jacobian Matrix

Quiz (7)Pulse Check Prereq Map

Why This Matters

Training a neural network means computing gradients of the loss with respect to every parameter, then updating those parameters via gradient descent. In a deep network, gradients must propagate backward through many layers. If the gradient shrinks at each layer, it vanishes by the time it reaches the early layers. If it grows, it explodes. Both cases make training fail.

This is not a theoretical curiosity. Vanishing gradients blocked progress in deep learning for over a decade (roughly 1995 to 2010). The solutions — ReLU activations, skip connections, and normalization layers — are in every modern architecture. Understanding why these solutions work requires understanding the gradient flow problem they solve.

Mental Model

Consider a chain of $L$ multiplications: $g_1 \cdot g_2 \cdots g_L$ . If each $g_i < 1$ , the product goes to 0 exponentially fast. If each $g_i > 1$ , the product goes to infinity. Only if each $g_i \approx 1$ does the product stay bounded and nonzero.

Backpropagation through an $L$ -layer network is exactly this: a product of $L$ Jacobian matrices. The singular values of these Jacobians determine whether gradients vanish, explode, or flow stably.

Formal Setup

Consider an $L$ -layer feedforward network:

$x^{(l)} = \sigma(W^{(l)} x^{(l-1)} + b^{(l)}), \quad l = 1, \ldots, L$

where $\sigma$ is the activation function applied elementwise. Let $z^{(l)} = W^{(l)} x^{(l-1)} + b^{(l)}$ be the pre-activation.

Definition

Gradient Flow via Chain Rule

By the chain rule, the gradient of loss $\mathcal{L}$ with respect to the parameters of layer $l$ involves:

$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial x^{(L)}} \cdot \prod_{k=l+1}^{L} \frac{\partial x^{(k)}}{\partial x^{(k-1)}} \cdot \frac{\partial x^{(l)}}{\partial W^{(l)}}$

The middle product of $L - l$ Jacobian matrices is where gradients vanish or explode.

Definition

Layer Jacobian $J^{(l)}$

The Jacobian of layer $l$ is:

$J^{(l)} = \frac{\partial x^{(l)}}{\partial x^{(l-1)}} = \text{diag}(\sigma'(z^{(l)})) \cdot W^{(l)}$

where $\text{diag}(\sigma'(z^{(l)}))$ is a diagonal matrix of activation derivatives. The gradient through $L - l$ layers is the product $J^{(L)} J^{(L-1)} \cdots J^{(l+1)}$ .

Main Theorems

Theorem

Jacobian Chain Gradient Bound

Statement

Let $\|\sigma'(z)\|_\infty \leq \gamma$ for all pre-activations $z$ , and let $\|W^{(l)}\|_2 \leq \rho$ for all layers $l$ . Then the gradient norm satisfies:

$\left\| \prod_{k=l+1}^{L} J^{(k)} \right\|_2 \leq (\gamma \rho)^{L - l}$

If $\gamma \rho < 1$ , the gradient vanishes exponentially in $L - l$ . If $\gamma \rho > 1$ , the gradient can explode exponentially in $L - l$ . Stable gradient flow requires $\gamma \rho \approx 1$ .

Intuition

Each layer multiplies the gradient by a factor of approximately $\gamma \rho$ . After $L - l$ layers, this compounds exponentially. For sigmoid activations, $\gamma = 1/4$ (the maximum of $\sigma'$ ), so even with well-conditioned weights ( $\rho \approx 1$ ), the product $\gamma \rho \approx 0.25$ . After 20 layers: $0.25^{20} \approx 10^{-12}$ .

Proof Sketch

Each Jacobian $J^{(k)} = \text{diag}(\sigma'(z^{(k)})) W^{(k)}$ has spectral norm at most $\gamma \rho$ by the submultiplicativity of spectral norms. The product of $L - l$ such matrices has spectral norm at most $(\gamma \rho)^{L-l}$ .

Why It Matters

This bound explains why sigmoid networks deeper than 5 to 10 layers are nearly impossible to train with standard gradient descent. The bound also prescribes the fix: choose $\sigma$ and initialize $W$ so that $\gamma \rho \approx 1$ .

Failure Mode

This is a worst-case bound. In practice, the Jacobian matrices are not all at their worst-case spectral norm simultaneously. The actual gradient can be larger or smaller depending on the data distribution and the correlations between successive Jacobians. Tighter analysis uses random matrix theory (e.g., the mean field theory approach).

report a correction →

Proposition

Sigmoid Gradient Saturation

Statement

For the sigmoid function $\sigma(z) = 1/(1 + e^{-z})$ , the derivative is:

$\sigma'(z) = \sigma(z)(1 - \sigma(z))$

The maximum value is $\sigma'(0) = 1/4$ . For $|z| > 5$ , the derivative is less than $0.007$ . This means:

Even at the best point, sigmoid shrinks gradients by a factor of 4 per layer.
When neurons saturate ( $|z|$ large), gradients effectively die.

Intuition

The sigmoid squashes all inputs to $(0, 1)$ . At the extremes, the function is nearly flat, so the derivative is nearly zero. Since backpropagation multiplies by this derivative at each layer, saturated neurons block gradient flow completely.

Proof Sketch

Differentiate $\sigma(z) = (1 + e^{-z})^{-1}$ to get $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ . This is maximized when $\sigma(z) = 1/2$ , i.e., $z = 0$ , giving $\sigma'(0) = 1/4$ . For $z = 5$ : $\sigma(5) \approx 0.9933$ , so $\sigma'(5) \approx 0.0066$ .

Why It Matters

This single property of the sigmoid function delayed deep learning by over a decade. The switch from sigmoid to ReLU (Glorot et al., 2011) was one of the key enablers of training networks with more than a few layers.

Failure Mode

Tanh has the same saturation problem, though its maximum derivative is 1 (at $z = 0$ ) instead of 1/4. This makes tanh better than sigmoid but still prone to saturation for large activations.

report a correction →

Activation Functions and Gradient Flow

ReLU ( $\sigma(z) = \max(0, z)$ ) has derivative 1 for $z > 0$ and 0 for $z < 0$ . This solves the shrinking problem: $\gamma = 1$ for active neurons. But it creates a new problem: neurons with $z < 0$ have zero gradient. If a neuron's pre-activation becomes permanently negative, it receives no gradient updates and is "dead." This is the dying ReLU problem. Proper weight initialization (He initialization) reduces the fraction of dead neurons at the start of training.

Leaky ReLU ( $\sigma(z) = \max(\alpha z, z)$ for small $\alpha > 0$ ) fixes dying neurons by allowing a small gradient for negative inputs.

GELU and SiLU (used in modern transformers) are smooth approximations of ReLU that avoid the non-differentiability at $z = 0$ while preserving the non-saturating property for large positive inputs.

Skip Connections

The most effective fix for vanishing gradients is the skip (residual) connection:

$x^{(l)} = x^{(l-1)} + f^{(l)}(x^{(l-1)})$

The Jacobian becomes:

$J^{(l)} = I + \frac{\partial f^{(l)}}{\partial x^{(l-1)}}$

The identity matrix $I$ ensures the gradient always has a component with magnitude 1, regardless of $\partial f / \partial x$ . The product of such Jacobians across layers retains identity-like terms that prevent exponential decay.

Normalization Layers

Batch normalization and layer normalization help gradient flow by keeping pre-activations in a range where activation derivatives are nonzero. By normalizing to zero mean and unit variance, they prevent the drift into saturation regions.

For sigmoid/tanh: normalization keeps $z$ near 0 where $\sigma'$ is maximal. For ReLU: normalization keeps approximately half of the neurons active.

Gradient Clipping

For exploding gradients, the standard fix is gradient clipping: if the gradient norm exceeds a threshold $c$ , rescale it:

$g \leftarrow \frac{c}{\|g\|} g \quad \text{when } \|g\| > c$

This does not change the gradient direction, only its magnitude. It prevents parameter updates from being catastrophically large.

Quantifying Gradient Pathology

The product-of-Jacobians formula makes gradient pathology precise. Define the end-to-end Jacobian from layer $l$ to layer $L$ as:

$\mathcal{J}_{l:L} = \prod_{k=l+1}^{L} J^{(k)} = \prod_{k=l+1}^{L} \text{diag}(\sigma'(z^{(k)})) W^{(k)}$

The gradient norm at layer $l$ satisfies $\|\partial \mathcal{L} / \partial x^{(l)}\|_2 \leq \|\partial \mathcal{L} / \partial x^{(L)}\|_2 \cdot \|\mathcal{J}_{l:L}\|_2$ .

By the submultiplicativity of the spectral norm, this telescopes:

$\|\mathcal{J}_{l:L}\|_2 \leq \prod_{k=l+1}^{L} \|J^{(k)}\|_2$

Each factor is bounded: $\|J^{(k)}\|_2 = \|\text{diag}(\sigma'(z^{(k)})) W^{(k)}\|_2 \leq \gamma_k \rho_k$ , where $\gamma_k = \max_j |\sigma'(z^{(k)}_j)|$ and $\rho_k = \|W^{(k)}\|_2$ is the spectral norm of the weight matrix.

The condition for vanishing is: $\prod_{k} \gamma_k \rho_k \to 0$ as $L \to \infty$ .

For sigmoid with $\gamma = 1/4$ and unit-spectral-norm weights ( $\rho = 1$ ): $\prod \gamma_k \rho_k = (1/4)^{L-l}$ , which hits $10^{-10}$ by layer 17.

The condition for stability: $\gamma_k \rho_k \approx 1$ at every layer. This prescribes two simultaneous requirements:

Activation derivatives near 1: use ReLU ( $\gamma = 1$ for active neurons) or careful normalization to keep sigmoid/tanh out of saturation.
Weight spectral norm near 1: achieved via weight initialization (He init sets the expected spectral norm close to 1) and spectral normalization.

The Jacobian matrix $J^{(k)}$ is a rectangular matrix if the layer changes dimension; its spectral norm is $\sigma_{\max}(J^{(k)})$ . Computing the exact spectral norm requires the full SVD, which is why approximations (power iteration) are used in spectral normalization.

One practical diagnostic: compute the ratio $\|\partial \mathcal{L} / \partial W^{(1)}\|_F / \|\partial \mathcal{L} / \partial W^{(L)}\|_F$ at initialization. If this ratio is below $10^{-3}$ for a 10-layer network, the architecture will not train without architectural changes.

Common Confusions

Watch Out

Vanishing gradients are not the same as zero loss gradient

Vanishing gradients mean the gradient signal shrinks as it propagates backward through layers. The loss gradient (at the output) can be large, but by the time it reaches layer 1, it has been multiplied by many small factors. This is a propagation problem, not a signal problem.

Watch Out

ReLU does not fully solve vanishing gradients

ReLU sets $\gamma = 1$ for active neurons, but neurons that are inactive ( $z < 0$ ) still have zero gradient. In a poorly initialized network, a large fraction of neurons can be dead. The vanishing gradient problem becomes a dead neuron problem. Proper initialization (He initialization: $W_{ij} \sim \mathcal{N}(0, 2/d_{\text{in}})$ ) is still necessary.

Watch Out

Gradient clipping is for exploding, not vanishing gradients

Gradient clipping caps the magnitude of large gradients. It does nothing for vanishing gradients. When gradients are too small, clipping has no effect. The fix for vanishing gradients is architectural: better activations, skip connections, and normalization.

Exercises

ExerciseCore

Problem

A 15-layer network uses sigmoid activations and has weight matrices with spectral norm 1. Compute an upper bound on the gradient magnitude ratio between layer 15 and layer 1.

ExerciseAdvanced

Problem

Show that the Jacobian of a residual block $x^{(l)} = x^{(l-1)} + f(x^{(l-1)})$ has minimum singular value bounded below by $1 - \|\partial f / \partial x\|_2$ , which is strictly positive whenever $\|\partial f / \partial x\|_2 < 1$ . What does this imply for gradient flow? (Note: the bound does not say the singular values are at least 1 themselves. Counterexample: if $\partial f / \partial x = -\tfrac{1}{2} I$ , then $J = \tfrac{1}{2} I$ and every singular value is $1/2$ .)

Related Comparisons

Gradient Clipping vs. Weight Decay

References

Canonical:

Hochreiter, "The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions" (1998). Original characterization of the product-of-Jacobians problem
He et al., "Deep Residual Learning for Image Recognition" (CVPR 2016), Section 3. Empirical and theoretical case for skip connections
Glorot & Bengio, "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (AISTATS 2010), Sections 2-3. Xavier initialization derivation

Activation functions:

He et al., "Delving Deep into Rectifiers" (ICCV 2015), Sections 2-3. He initialization for ReLU, $\mathcal{N}(0, 2/d_{\text{in}})$
Nair & Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines" (ICML 2010)
Glorot, Bordes & Bengio, "Deep Sparse Rectifier Neural Networks" (AISTATS 2011), Section 3

Theory:

Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 8.2 (challenges in optimization) and Chapter 6.3 (gradient pathology)

Next Topics

Batch normalization: how normalization stabilizes training beyond gradient flow
Residual stream and transformer internals: how skip connections function as a communication bus in transformers

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

The Jacobian Matrixlayer 0A · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

3

Batch Normalizationlayer 2 · tier 1
Residual Stream and Transformer Internalslayer 4 · tier 2
Neural ODEs and Continuous-Depth Networkslayer 4 · tier 3

Graph-backed continuations

Batch Normalization Residual Stream and Transformer Internals Neural ODEs and Continuous-Depth Networks