~/blog
Vanishing Gradient Problem
You train a 5-layer sigmoid network on a classification task and watch the loss barely move for the first hundred epochs. The model isn't broken — gradient descent is running, weights are updating. But the first two layers are learning almost nothing. The reason is not a bug you can fix in your code. It is a structural consequence of how sigmoid compresses values, and it plagued deep learning for nearly two decades.
This post makes the problem concrete with numbers. By the end you will know exactly why gradients vanish, how fast they vanish, and what the solutions are.
The Anchor Network
A 5-layer network: one input → four sigmoid hidden layers → one output. All weights initialized to 0.1, all biases to 0. Input x = 1.0, true label y = 1.
X = 1.0
y = 1
# 5-layer network: input → sigmoid → sigmoid → sigmoid → sigmoid → output
# weights: w = 0.1 at every connection, bias = 0 everywhereThis is deliberately simple — every layer has one neuron — so the gradient arithmetic stays readable. The exponential decay still happens in full force.
Why Sigmoid Causes Gradients to Vanish
The sigmoid function maps any real number to (0, 1):
σ(z) = 1 / (1 + e⁻ᶻ)
Its derivative is:
σ'(z) = σ(z) · (1 − σ(z))
The critical fact: σ'(z) has a maximum value of 0.25, achieved when z = 0. For any z with |z| > 2, σ'(z) drops below 0.10. For |z| > 4, it drops below 0.018.
To see why 0.25 is the max, take the derivative of σ(z)(1−σ(z)) with respect to z and set it to zero. The maximum occurs at σ(z) = 0.5, which means z = 0. Plugging in: σ'(0) = 0.5 × 0.5 = 0.25.
The flat regions at both ends are the saturation zones. When a neuron's pre-activation z sits in these regions, its gradient contribution is nearly zero. Gradients passing through saturated neurons are multiplied by a tiny number and arrive at earlier layers shrunken.
The Forward Pass
Run the anchor network forward with w = 0.1, b = 0 at each layer:
Layer 1: z₁ = 0.1 × 1.0 = 0.1, a₁ = σ(0.1) = 1/(1 + e⁻⁰·¹) = 0.525, σ'(z₁) = 0.525 × 0.475 = 0.249
Layer 2: z₂ = 0.1 × 0.525 = 0.0525, a₂ = σ(0.0525) = 0.513, σ'(z₂) = 0.513 × 0.487 = 0.250
Layer 3: z₃ = 0.1 × 0.513 = 0.0513, a₃ = σ(0.0513) = 0.513, σ'(z₃) = 0.513 × 0.487 = 0.250
Layer 4: z₄ = 0.1 × 0.513 = 0.0513, a₄ = σ(0.0513) = 0.513, σ'(z₄) = 0.513 × 0.487 = 0.250
All four hidden layers produce activations around 0.513 and derivatives around 0.250. The weights are in the safe zone — the activations aren't saturated. This is actually the best-case scenario for sigmoid: every derivative is near its maximum of 0.25. Even here, the gradient will vanish.
The Backward Pass — Where Gradients Disappear
Backpropagation multiplies local gradients together as it moves backward through the network. At each layer, the gradient shrinks by a factor of (w × σ'(z)):
At layer 4: this factor is 0.1 × 0.250 = 0.025
Starting with an output gradient δ₄ = 1.0 (simplified) and working backward:
δ₃ = δ₄ × w × σ'(z₄) = 1.0 × 0.1 × 0.250 = 0.025
δ₂ = δ₃ × w × σ'(z₃) = 0.025 × 0.1 × 0.250 = 0.000625
δ₁ = δ₂ × w × σ'(z₂) = 0.000625 × 0.1 × 0.250 = 0.0000156
The gradient at layer 1 is:
∂L/∂W₁ = δ₁ × x = 0.0000156 × 1.0 = 1.56 × 10⁻⁵
With learning rate η = 0.01, the first-layer weight update is:
ΔW₁ = −0.01 × 1.56 × 10⁻⁵ = −1.56 × 10⁻⁷
The weight at layer 4 moves 0.025 per step. The weight at layer 1 moves 0.000000156 per step — 160,000 times slower. After 100 epochs, the output layer weight has moved 2.5 units worth of gradient; the first layer weight has moved 0.0000156 units. The first layer has effectively not trained.
Trace Table
| Layer | z | σ(z) | σ'(z) | Gradient δ |
|---|---|---|---|---|
| 4 (closest to output) | 0.0513 | 0.513 | 0.250 | 1.000 |
| 3 | 0.0513 | 0.513 | 0.250 | 0.025 |
| 2 | 0.0525 | 0.513 | 0.250 | 6.25 × 10⁻⁴ |
| 1 (closest to input) | 0.1 | 0.525 | 0.249 | 1.56 × 10⁻⁵ |
The gradient drops by a factor of 40 per layer (0.1 × 0.25 = 0.025, so gradient shrinks by 1/0.025 = 40). After four layers: 1.0 × (0.025)⁴? No — the factor compounds. Each step multiplies by 0.025: 1.0 → 0.025 → 0.000625 → 1.5625×10⁻⁵. A 64,000× reduction in four layers.
Code: Measuring the Decay
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_grad(z):
s = sigmoid(z)
return s * (1 - s)
x = 1.0
w = 0.1
activations = [x]
z_values = []
grad_values = []
for _ in range(4):
z = w * activations[-1]
z_values.append(z)
activations.append(sigmoid(z))
grad_values.append(sigmoid_grad(z))
# Backprop — simplified: no loss, just trace gradient flow
delta = 1.0
deltas = []
for g in reversed(grad_values):
delta = delta * w * g
deltas.insert(0, delta)
print("Forward pass:")
for i, (z, a) in enumerate(zip(z_values, activations[1:]), 1):
print(f" Layer {i}: z={z:.4f} σ(z)={a:.4f} σ'(z)={sigmoid_grad(z):.4f}")
print("\nBackward pass (gradient at each layer):")
for i, d in enumerate(deltas, 1):
print(f" Layer {i}: δ = {d:.2e}")
print(f"\nWeight update at layer 1 (lr=0.01): ΔW = {0.01 * deltas[0]:.2e}")
print(f"Weight update at layer 4 (lr=0.01): ΔW = {0.01 * deltas[3]:.2e}")
print(f"Ratio layer4/layer1: {deltas[3]/deltas[0]:,.0f}x faster")Forward pass:
Layer 1: z=0.1000 σ(z)=0.5250 σ'(z)=0.2494
Layer 2: z=0.0525 σ(z)=0.5131 σ'(z)=0.2498
Layer 3: z=0.0513 σ(z)=0.5128 σ'(z)=0.2499
Layer 4: z=0.0513 σ(z)=0.5128 σ'(z)=0.2499
Backward pass (gradient at each layer):
Layer 1: δ = 1.56e-05
Layer 2: δ = 6.25e-04
Layer 3: δ = 2.50e-02
Layer 4: δ = 1.00e+00
Weight update at layer 1 (lr=0.01): ΔW = 1.56e-07
Weight update at layer 4 (lr=0.01): ΔW = 1.00e-02
Ratio layer4/layer1: 64,103x fasterLayer 4 trains 64,000× faster than layer 1 in a single update. Scale this to 100 epochs: the output layer has effectively trained; the first layer has barely moved.
What This Looks Like During Training
In practice, the symptom is a loss curve that stalls immediately and refuses to drop below a certain value for many epochs. The output layer converges quickly; the hidden layers near the input contribute nothing meaningful.
A 5-layer sigmoid network trained on a binary classification task will show this pattern on a loss curve:
The sigmoid network loss plateaus near its initialization loss and barely moves. The ReLU network drops steadily. Both networks have identical architecture — only the activation function differs.
The Four Solutions (Brief Preview)
Understanding why gradients vanish points directly to the solutions. Each fix targets a specific cause:
ReLU. Replace σ with max(0, z). For z > 0, σ'(z) = 1 — no attenuation. Gradients pass through unsaturated ReLU neurons unchanged. Covered in depth in the ReLU post.
Xavier / He weight initialization. With weights near zero (like w = 0.1), activations cluster around 0.5 and derivatives are near their max — but the weight magnitude itself (0.1) is the real culprit here, since 0.1 × 0.25 = 0.025 compounds quickly. Scaling initial weights to match the layer's fan-in (Xavier) or fan-out (He) keeps activation variance stable and gradients from collapsing. Covered in the regularization section.
Batch normalization. Re-centers the pre-activations within each mini-batch to zero mean and unit variance before the nonlinearity. This forces activations away from the saturation regions (|z| >> 0) even when weights push z toward the tails. The result: sigmoid's derivative stays closer to 0.25 rather than collapsing toward 0.
Residual connections (skip connections). Add the input directly to the output: output = F(x) + x. The + x term creates a gradient highway: ∂L/∂x = ∂L/∂output · (F'(x) + 1). Even if F'(x) vanishes, the gradient from the skip path (+1) flows unimpeded. This is the architectural fix behind ResNets, which can train 152 layers where vanilla deep networks stall at 5.
Related Concepts
Where this builds from: The vanishing gradient is a consequence of backpropagation (covered in the previous section) and the chain rule. Each layer's local gradient multiplies into the running product; sigmoid's derivative is the weak link. You also need to understand sigmoid's saturation behavior — covered next in the sigmoid post.
Where this leads: ReLU (next post) is the primary fix. Weight initialization and batch normalization are more surgical fixes covered in the regularization section. The vanishing gradient problem is also the main reason LSTMs were invented — they use gating mechanisms to carry gradients over many timesteps without compounding decay.
Honest Limitations
With fewer than 3 hidden layers and well-scaled weights, sigmoid networks often work fine — the 0.25 factor doesn't compound enough times to cause serious damage, and ReLU is not strictly necessary.
The vanishing gradient problem is sometimes confused with the exploding gradient problem. Both arise from the same compounding mechanism. Vanishing happens when gradients shrink to near zero (weights < 1 combined with derivative < 1); exploding happens when they grow unboundedly (weights > 1). Both require different fixes.
ReLU solves the vanishing gradient for positive activations, but introduces "dead ReLU" neurons — units where z ≤ 0 for all training inputs, causing their gradient to be exactly zero permanently. This is a different but related stability problem, covered in the ReLU post.
Test Your Understanding
-
The sigmoid derivative σ'(z) has a maximum of 0.25. Derive this maximum algebraically from σ'(z) = σ(z)(1−σ(z)) without using calculus — use the AM-GM inequality or complete the square instead. At what value of σ(z) is the maximum achieved?
-
In the anchor network, the gradient at layer 1 is 1.56 × 10⁻⁵ and at layer 4 is 1.0. If you added a 5th hidden layer between the input and layer 1, what would the gradient at the new first layer be? What about a 6th layer?
-
A 4-layer network uses sigmoid but initializes weights to w = 0.5 instead of 0.1. Compute the gradient at layer 1 (assume σ'(z) ≈ 0.20 at every layer). Is the vanishing problem better or worse? What if w = 2.0?
-
Batch normalization is described as keeping activations "away from saturation." Why does the saturation region (|z| > 3) produce near-zero gradients even when the weight is large (e.g., w = 10)? Show this numerically with σ'(5) and compare to σ'(0.5).
-
You replace sigmoid with ReLU in the anchor network (same weights, same input). Compute the gradient at layer 1. Then replace all but the output layer with ReLU (keep sigmoid at output for probabilistic output). Does the vanishing problem disappear entirely?