Training Techniques
Weight Initialization
Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.
Prerequisites
Why This Matters
A neural network with bad initialization cannot train. If weights are too large, activations and gradients explode exponentially with depth. If weights are too small, they vanish exponentially. Proper initialization preserves gradient flow across layers. The entire field of deep learning was stuck on shallow networks for decades partly because of this problem. Xavier and He initialization solved it for standard architectures, and techniques like batch normalization further reduce sensitivity to the initial scale.
Mental Model
Consider a deep network as a chain of matrix multiplications. If each matrix has eigenvalues with magnitude greater than 1, the product grows exponentially. If less than 1, it shrinks exponentially. Good initialization sets the weight matrices so their effect on signal magnitude is approximately 1 per layer. The signal (activations in the forward pass, gradients in the backward pass) neither grows nor shrinks.
Formal Setup
Consider a feedforward network with layers and no activation function (for now):
where and is the input. The output is .
Variance Preservation Property
An initialization scheme satisfies variance preservation if, for each layer :
where denotes the -th component of the activation vector at layer . Equivalently, the signal magnitude stays constant across layers at initialization.
The Problem with Naive Initialization
If has entries drawn i.i.d. from and has zero-mean entries with variance , then:
Each layer multiplies the variance by . After layers:
If for all layers, activations explode. If , they vanish. For a network with 50 layers and , the activation variance grows by . At , it shrinks by .
Main Theorems
Xavier/Glorot Initialization
Statement
For a layer with input units and output units, initializing weights as:
preserves variance in both the forward pass (activations) and the backward pass (gradients) simultaneously, under the compromise that the forward pass condition requires and the backward pass requires .
Intuition
Forward pass: each activation is a sum of terms. To keep variance at 1, each term should have variance , so . Backward pass: each gradient component is a sum of terms, requiring . Xavier takes the harmonic mean of these two requirements: .
Proof Sketch
For the forward pass, . Since and are independent and zero-mean, . Setting this equal to gives . The backward pass derivation is symmetric with .
Why It Matters
Xavier initialization enabled training of deep networks with sigmoid and tanh activations. Before Xavier, networks deeper than ~5 layers were considered impractical to train. The paper (Glorot and Bengio, 2010) was a turning point for deep learning.
Failure Mode
Xavier assumes linear or tanh activations near zero. ReLU sets half of its inputs to zero, which halves the effective fan-in. Xavier underestimates the required variance for ReLU networks, leading to signal decay. He initialization fixes this.
He/Kaiming Initialization
Statement
For a layer with input units followed by a ReLU activation, initializing weights as:
preserves the variance of activations across layers.
Intuition
ReLU zeros out negative inputs, keeping only the positive half. For a zero-mean symmetric input , this halves the second moment: . The He derivation propagates second moments rather than variances (the pre-activations at the next layer are again zero-mean because weights are zero-mean independent). To compensate for this halving, we double the weight variance compared to Xavier (using instead of ).
Proof Sketch
Let , so (since is zero-mean). After ReLU: . For , (half of the symmetric second moment; the variance is strictly smaller because the mean is no longer zero). Setting gives , so .
Why It Matters
He initialization made it possible to train very deep ReLU networks (100+ layers). It was a key ingredient in the ResNet paper (He et al., 2015), which demonstrated that networks with 152 layers could train successfully.
Failure Mode
He initialization assumes standard ReLU. For leaky ReLU with slope on the negative side, the correction factor is instead of 2. For GELU, SiLU, or other smooth activations, the exact correction differs but He initialization is still a reasonable starting point.
The Symmetry Breaking Argument
Zero Initialization Fails
Statement
If all weights in a layer are initialized to the same value (including zero), then all neurons in that layer compute the same function, receive the same gradient, and remain identical throughout training. The layer effectively has only one neuron regardless of its width.
Intuition
At initialization, every neuron in a layer computes with identical and . The outputs are identical, so the loss gradient with respect to each neuron's weights is identical. After the gradient update, all weights remain equal. This symmetry is never broken by gradient descent.
Proof Sketch
By induction on the training step. At step 0, all neurons in layer have weights and bias , so for all . The gradient depends on and . Since all are equal and the downstream computation treats all neurons symmetrically, all gradients are equal. The update preserves equality.
Why It Matters
This is why random initialization is necessary, not optional. The randomness serves one purpose: break the symmetry between neurons so they can specialize to different features during training. The magnitude of the random initialization then determines signal propagation (Xavier/He), but the randomness itself is the mechanism for expressivity.
Failure Mode
Biases can safely be initialized to zero. They do not cause symmetry problems because different neurons in the same layer share the same bias value but have different weight vectors (due to random weight initialization). The bias just shifts the activation; it does not contribute to the symmetry between neurons. The exception is LSTM forget gate biases, which are typically initialized to 1 to encourage gradient flow at the start of training.
What Happens with All-Zero Initialization
Setting all weights to zero is the most catastrophic initialization. Beyond the symmetry problem, zero weights mean zero activations at every layer (for networks without bias). The gradient of the loss with respect to the weights is also zero (since activations are zero), so gradient descent makes no progress. The network is stuck at its initial state permanently.
With biases but zero weights, the network produces constant output regardless of input. Gradients are nonzero but identical across neurons in each layer, so the symmetry is never broken.
Concrete Examples of Bad Initialization
Variance explosion with naive initialization
Consider a 20-layer ReLU network with width 512 at every layer. Initialize weights as . The variance multiplier per layer is (the factor of 2 accounts for ReLU zeroing negative inputs). After 20 layers: . An input with unit variance produces activations with variance around at the output layer, standard deviations around . Training is unstable; gradients saturate and the loss blows up within a few steps. A modestly worse setting (e.g., giving multiplier ) reaches float32 overflow in single digits of depth.
Reducing to : the multiplier becomes . After 20 layers: . Activations vanish to zero. Gradients are effectively zero. Training makes no progress.
He initialization sets . The multiplier is . After 20 layers: . Variance is preserved exactly.
GPT-style transformer initialization
Large transformer models use a scaled initialization for residual stream contributions. In a transformer with layers, each layer adds to the residual stream. If each addition has variance , the residual stream variance grows as after layers. GPT-2 scales the output projection of each attention and MLP block by , so the total variance contribution from all layers is . This keeps the residual stream variance bounded regardless of depth. Without this scaling, a 96-layer GPT-3 would have activations growing by a factor of from residual accumulation alone.
Initialization interacts with residual connections
In a ResNet, the output of each block is . If has variance and has variance , the output has variance (assuming independence). After blocks: variance is . This linear growth (instead of the exponential growth without skip connections) makes ResNets much less sensitive to initialization. But for very deep ResNets (1000+ layers), even linear growth can cause problems. Fixup initialization (Zhang, Dauphin, Ma 2019) rescales residual branches by where is the number of layers inside each block, removing the need for normalization. ReZero (Bachlechner et al. 2020) initializes a learnable scalar on each residual branch, so the network starts as the identity and gradients propagate unmodified. For transformers, DeepNorm (Wang et al. 2022) scales the residual branch by and the weights by , enabling stable training of 1000-layer Transformers.
Xavier Derivation: Forward and Backward Conditions
The Xavier initialization variance is a compromise between two conflicting requirements.
Forward pass condition. For with independent zero-mean weights and activations:
Preserving variance requires .
Backward pass condition. The gradient flows as , where . By the same argument:
Preserving gradient variance requires .
These two conditions are incompatible unless . Xavier's solution is the harmonic mean: , which approximately preserves both forward and backward signal magnitudes. For layers where (common in practice), this is close to both and .
The uniform variant samples from where , since a uniform on has variance .
Signal Propagation Theory
Signal propagation theory generalizes the Xavier/He analysis to arbitrary architectures and activation functions. The central question: for a random input , what happens to the distribution of as grows?
Define the mean field quantities:
where and are activations from two different inputs. The quantity tracks signal magnitude, and tracks the correlation between representations of different inputs.
For stable training, we need:
- stays bounded and bounded away from zero (no explosion or collapse).
- does not converge to 1 (otherwise all inputs produce the same representation, and the network cannot distinguish them).
Both conditions place constraints on the weight variance and bias variance . The critical line (where is the activation function and ) separates the ordered phase (signals collapse) from the chaotic phase (signals explode). Xavier and He initialization place the network near this critical line.
Connection to NTK Parameterization
The Neural Tangent Kernel (NTK) parameterization scales the output of each layer by , where is the layer width:
with (unit variance, not scaled). The factors are built into the architecture rather than the initialization.
Where the factor of 2 lives. The NTK parameterization with per-layer preserves forward signal variance for linear activations only: the per-layer multiplier is . For ReLU, the same scaling halves the activation second moment each layer (since when is symmetric), so the network is on the ordered side of the edge of chaos. He initialization compensates by putting the factor of 2 into the weight variance ( under the NTK parameterization, or equivalently under standard parameterization). In short: the factor of 2 lives in either the weight variance (He/standard) or the per-layer NTK prefactor ( with unit-variance ), but it must live somewhere whenever ReLU is used. The two conventions describe the same function class; they differ only in how learning-rate scaling interacts with width.
The NTK parameterization is the standard in theoretical analyses of infinite-width networks, where the network's training dynamics converge to a kernel regression with a fixed kernel (the NTK). The practical consequence: understanding initialization in the NTK framework connects finite-width training dynamics to the well-understood theory of kernel methods.
NTK parameterization does not change the function class
The NTK parameterization and the standard parameterization with He-initialized weights represent the same set of functions. The difference is in how the learning rate scales with width. In the NTK parameterization, the gradient update has magnitude regardless of width, which makes the infinite-width limit well-defined. In the standard parameterization, you must scale the learning rate as to get the same behavior.
Connection to Random Matrix Theory
At initialization, the weight matrices are random matrices. The product determines how signals propagate. For with i.i.d. Gaussian entries of variance , the expected squared singular value of each factor is 1, so the second moment of the activations is preserved on average — this is exactly the property Xavier/He initialization buys. Crucially, this is a statement about averages, not about the full singular value spectrum.
Variance preservation dynamical isometry. For Gaussian or He-initialized weights with ReLU activations, the full spectrum of does not concentrate at 1 — even when the second moment is preserved, the singular values spread out and the spectrum has a long tail that grows with depth (Pennington, Schoenholz & Ganguli, 2017). Standard inits keep activations from blowing up on average, but individual input directions can still be amplified or attenuated exponentially. Dynamical isometry — where every singular value of the input-output Jacobian is close to 1 — is a strictly stronger condition that requires orthogonal initialization combined with activations whose derivative spectrum is well-behaved (e.g. orthogonal init + tanh near the linear regime). For Gaussian + ReLU, dynamical isometry is unattainable at any variance.
The connection in one line: Xavier/He pick the variance so the mean squared singular value of each factor is 1, which is enough to prevent forward/backward signal blowup on average. Achieving the stronger dynamical isometry property requires a different initialization family.
Common Confusions
Xavier is not wrong for ReLU, it is suboptimal
Xavier initialization with ReLU networks does not cause immediate divergence. It causes gradual signal decay because each ReLU layer halves the variance that Xavier predicts will be preserved. For shallow networks (5-10 layers), the decay is mild. For deep networks (50+ layers), the effect compounds and training fails. He initialization is the correct fix.
Initialization is about the first step, not the entire training
Good initialization ensures stable gradients at step 0. Once training begins, the weights move away from their initial values. Batch normalization and residual connections help maintain stability throughout training, reducing (but not eliminating) the importance of initialization.
Summary
- Bad initialization causes exponential growth or decay of activations and gradients across layers
- Xavier: , designed for linear/tanh activations
- He: , designed for ReLU (doubles Xavier to compensate for zeroing negative inputs)
- Both derive from a single principle: preserve variance across layers
- Random matrix theory explains why these scalings keep the mean squared singular value near 1, preventing average-case blowup; the stronger dynamical isometry condition requires orthogonal initialization, not Gaussian/ReLU
Exercises
Problem
A network has 10 hidden layers, each with 512 units and ReLU activations. Compute the activation variance at layer 10 relative to the input variance under (a) Xavier initialization and (b) He initialization.
Problem
Derive the correct initialization variance for a layer using Leaky ReLU with negative slope . How much does it differ from standard He initialization?
Problem
A network has 3 hidden layers with widths 256, 128, 64 and uses tanh activations. Write the Xavier initialization variance for each layer's weight matrix.
Problem
Explain why initializing all weights to the same nonzero constant (e.g., for all ) fails even though the weights are nonzero. How does this differ from initializing weights to a constant but biases randomly?
References
Canonical:
- Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks" (2010), AISTATS, Sections 1-4
- He et al., "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" (2015), ICCV, Section 2 (initialization derivation)
- LeCun et al., "Efficient BackProp" (1998), in Neural Networks: Tricks of the Trade, Section 4.6 (weight initialization heuristics)
Current:
- Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (2018), NeurIPS (NTK parameterization)
- Pennington & Worah, "Nonlinear Random Matrix Theory for Deep Learning" (2017), NeurIPS (signal propagation analysis)
- Schoenholz et al., "Deep Information Propagation" (2017), ICLR (mean field theory for deep networks, edge of chaos)
- Huang et al., "Improving Transformer Training with Orthogonal Initialization" (2020)
- Zhang, Dauphin & Ma, "Fixup Initialization: Residual Learning Without Normalization" (2019), ICLR (rescaling residual branches by depth)
- Bachlechner et al., "ReZero is All You Need: Fast Convergence at Large Depth" (2020), arXiv:2003.04887 (learnable zero-init scalar on residual branches)
- Wang et al., "DeepNet: Scaling Transformers to 1,000 Layers" (2022), arXiv:2203.00555 (DeepNorm residual and weight scaling)
- Noci et al., "Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse" (2022), NeurIPS (rank-collapse analysis of transformer init)
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Activation Functionslayer 1 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
Derived topics
1- Batch Normalizationlayer 2 · tier 1
Graph-backed continuations