A plain recurrent neural network carries information forward one hidden state at a time — each step blends the current input with whatever the previous step remembered. That works for short sequences. It quietly breaks for long ones, because the same mechanism that lets information flow forward also forces gradients to flow backward through a long chain of multiplications during training — and multiplying many numbers smaller than 1 together drives the product toward zero. LSTM exists specifically to fix that chain.
Anchor: 5 days of stock closing prices, predicting day 6.
sequence = [100, 102, 105, 103, 108] # closing pricesThe task needs memory across all 5 past values — an early market shock on day 1 might still matter for day 6's prediction.
Vanilla RNN: What It Does
A vanilla RNN's hidden state update: hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ + b)
Each step takes the current input xₜ and the previous hidden state hₜ₋₁, combines them linearly, and squashes the result through tanh. Using Wₕ = 0.5, Wₓ = 0.01, b = 0, h₀ = 0:
t=1: z₁ = 0.5(0) + 0.01(100) = 1.0000 → h₁ = tanh(1.0000) = 0.7616 t=2: z₂ = 0.5(0.7616) + 0.01(102) = 1.4008 → h₂ = tanh(1.4008) = 0.8855 t=3: z₃ = 0.5(0.8855) + 0.01(105) = 1.4928 → h₃ = tanh(1.4928) = 0.9038 t=4: z₄ = 0.5(0.9038) + 0.01(103) = 1.4819 → h₄ = tanh(1.4819) = 0.9018 t=5: z₅ = 0.5(0.9018) + 0.01(108) = 1.5309 → h₅ = tanh(1.5309) = 0.9106
h₅ is what a prediction head would read to forecast day 6 — it's supposed to summarize everything from x₁ through x₅.
The Vanishing Gradient in RNNs (BPTT)
Training an RNN uses Backpropagation Through Time (BPTT): unroll the recurrence into T copies, backpropagate as if it were a T-layer feedforward network. The gradient reaching the earliest hidden state is a product of T local gradients:
∂L/∂h₀ = ∂L/∂h_T × Π₍ₜ₌₁₎^T ∂hₜ/∂hₜ₋₁
Each factor: ∂hₜ/∂hₜ₋₁ = Wₕ × tanh'(zₜ), and tanh' is always ≤ 1.0 — so every factor in that product is at most |Wₕ|, typically well under 1.
If each factor happens to equal 0.7: with T=5, the cumulative product is 0.7⁵ = 0.168 — already a 6× shrinkage after just 5 steps. Extrapolate to T=100 (a sequence 100 steps long): 0.7¹⁰⁰ ≈ 0 — the gradient reaching the first timestep is indistinguishable from zero, meaning that timestep receives essentially no training signal at all.
The anchor sequence shows the same effect at smaller scale — using the actual local gradients computed from the forward pass above (Wₕ × tanh'(zₜ) at each step), the cumulative BPTT product from t=5 back to t=1 is:
t=5: local=0.0854, cumulative=0.085421
t=4: local=0.0934, cumulative=0.007974
t=3: local=0.0915, cumulative=0.000730
t=2: local=0.1079, cumulative=0.000079
t=1: local=0.2100, cumulative=0.000017By t=1, the cumulative gradient has shrunk to 0.000017 — over 4 orders of magnitude smaller than at t=5. Day 1's price barely influences the weight update at all, even though the task explicitly needs it.
Concrete Failure Case
Imagine predicting today's stock price where the critical signal is a market crash that happened 90 days ago — a genuine long-range dependency. A vanilla RNN with tanh activations, following the same math above, has a gradient at lag 90 that has been multiplied by ~90 factors each under 1 — effectively zero. Trained on data with this dependency, a vanilla RNN produces predictions that look essentially random once the lag exceeds roughly 10 steps: it can pick up short-range correlations (yesterday predicts today reasonably well) but the crash 90 days back never gets learned, because its gradient never survived the backward pass. An LSTM trained on the same data learns the dependency correctly, because its gradient path (below) doesn't shrink the same way.
What LSTM Solves
LSTM introduces a second recurrent path — the cell state Cₜ — that runs alongside the hidden state like a conveyor belt. Information can flow along the cell state across many timesteps with only minor, additive modifications, controlled by learned gates that decide what to keep, what to add, and what to expose as output.
The critical difference: the cell state's update is dominated by addition (Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ), not repeated multiplication by a weight matrix and a squashing derivative. Gradients flowing backward through addition don't shrink multiplicatively the way they do through the vanilla RNN's tanh(Wₕhₜ₋₁ + ...) chain. That's the whole fix — not a bigger network, a different gradient path.
When to Use LSTM vs Other Sequence Models
| Sequence length | Pattern type | Recommended |
|---|---|---|
| Short (<20) | Simple | Vanilla RNN |
| Medium (20–200) | Complex long-range | LSTM |
| Long (200+) | Complex | LSTM + attention or Transformer |
| Variable, parallel | Any | Transformer |
Hyperparameter Sensitivity: Recurrent Weight Wₕ
Wₕ controls how strongly the previous hidden state feeds into the next one — it's the single number that determines whether a vanilla RNN vanishes, and it's the natural knob to sweep on the anchor sequence.
import numpy as np
sequence = [100, 102, 105, 103, 108]
tanh_grad = lambda z: 1 - np.tanh(z)**2
def cumulative_grad(Wh):
h = 0.0
grads = []
for x in sequence:
z = Wh * h + 0.01 * x
h = np.tanh(z)
grads.append(Wh * tanh_grad(z))
cumulative = 1.0
for g in reversed(grads):
cumulative *= g
return cumulative, h
for Wh in [0.1, 0.3, 0.5, 0.7, 0.9, 1.2, 1.5, 2.0]:
cum, h5 = cumulative_grad(Wh)
print(f"Wh={Wh}: cumulative_grad_at_t1={cum:.8f}, h5={h5:.4f}")Wh=0.1: cumulative_grad_at_t1=0.00000006, h5=0.8212
Wh=0.3: cumulative_grad_at_t1=0.00000464, h5=0.8711
Wh=0.5: cumulative_grad_at_t1=0.00001654, h5=0.9106
Wh=0.7: cumulative_grad_at_t1=0.00002231, h5=0.9395
Wh=0.9: cumulative_grad_at_t1=0.00001845, h5=0.9595
Wh=1.2: cumulative_grad_at_t1=0.00000833, h5=0.9781
Wh=1.5: cumulative_grad_at_t1=0.00000263, h5=0.9881
Wh=2.0: cumulative_grad_at_t1=0.00000025, h5=0.9957The gradient at t=1 stays vanishingly small across the entire range — it never climbs above ~2×10⁻⁵, whether Wₕ is 0.1 or 2.0. That's not a coincidence: as Wₕ grows, the pre-activation zₜ grows too, which pushes hₜ deeper into tanh's saturated region, where tanh'(zₜ) approaches 0. A larger weight and a smaller derivative cancel each other out. This is why "just increase the weights" isn't a fix for vanishing gradients in a tanh-based RNN — the same saturation that causes vanishing in the first place absorbs any increase in Wₕ. Genuine exploding gradients from large recurrent weights show up mainly in networks without a bounded activation, or when the input scale keeps zₜ away from saturation — neither is the case here.
Code
import numpy as np
# Simulate vanishing gradient in vanilla RNN
Wh = 0.5 # single-dim for illustration
tanh_grad = lambda z: 1 - np.tanh(z)**2
sequence = [100, 102, 105, 103, 108]
h = 0.0
grads = []
for x in sequence:
z = Wh * h + 0.01 * x
h = np.tanh(z)
grads.append(Wh * tanh_grad(z))
# Cumulative gradient (BPTT)
cumulative = 1.0
print("Gradient per step and cumulative (BPTT):")
for i, g in enumerate(reversed(grads)):
cumulative *= g
print(f" t={len(sequence)-i}: local={g:.4f}, cumulative={cumulative:.6f}")Gradient per step and cumulative (BPTT):
t=5: local=0.0854, cumulative=0.085421
t=4: local=0.0934, cumulative=0.007974
t=3: local=0.0915, cumulative=0.000730
t=2: local=0.1079, cumulative=0.000079
t=1: local=0.2100, cumulative=0.000017Related Concepts
Where this builds from: The vanishing gradient problem was introduced in the activation functions section (post 01) in the context of deep feedforward networks and tanh/sigmoid saturation — BPTT is the same phenomenon applied along the time dimension instead of the layer dimension.
Where this leads: The next post breaks open the LSTM cell itself — the forget gate, input gate, cell state update, and output gate that together create the addition-based gradient path described here. GRU (post 08 of this section) offers a simpler, cheaper alternative once the LSTM mechanics are understood.
Honest Limitations
With sequences longer than roughly 100 timesteps, a vanilla RNN's gradient at the first timestep shrinks to effectively zero (0.7¹⁰⁰ ≈ 0 in the example above) — that timestep contributes nothing to learning, regardless of how important it is to the task. This isn't a training-hyperparameter problem; no learning rate or optimizer change fixes a gradient that has already vanished to floating-point zero.
Vanilla RNNs (and LSTMs) only see the past — they cannot use future context in the sequence unless run bidirectionally. For tasks like part-of-speech tagging, where the correct label for a word can depend on words that come after it, a unidirectional RNN or LSTM is structurally blind to information it needs; a bidirectional RNN or a Transformer with full self-attention is required.
Test Your Understanding
-
Why does replacing repeated multiplication with addition in the cell state update path prevent the gradient from vanishing, even though the LSTM still uses tanh and sigmoid activations internally?
-
Using the local gradient values from the anchor's code output (0.0854, 0.0934, 0.0915, 0.1079, 0.2100 at t=5 through t=1), verify by hand that the cumulative product at t=1 matches 0.000017. What would the cumulative value be if the sequence were extended to 20 timesteps with the same average local gradient (~0.1)?
-
A vanilla RNN trained on the 90-day-lag stock example produces near-random predictions for the crash signal but reasonable one-step-ahead predictions. Why does the same vanishing-gradient mechanism that breaks the 90-day dependency not equally break the 1-day dependency?
-
Suppose Wₕ = 1.5 instead of 0.5 in the vanilla RNN forward pass. Would the vanishing gradient problem still occur, or would a different problem emerge instead? Name it.
-
A colleague argues that simply increasing the learning rate should compensate for a vanishing gradient, since a smaller gradient times a larger learning rate could produce the same weight update magnitude. Why does this reasoning fail once the gradient has shrunk to something like 0.000017 or smaller?