Adagrad accumulates squared gradients forever. After enough training steps, the effective learning rate shrinks to zero for all parameters — the model freezes before reaching the minimum. RMSProp (Root Mean Square Propagation) fixes this with a single change: replace the running sum G_t with an exponentially weighted moving average E[g²]_t. Old gradients fade out; only recent gradient history matters.
Same anchor: w₁ (dense, gradient every step) and w₂ (sparse, gradient 1 in 5 steps). Same gradient sequence as the Adagrad post.
Adagrad's Problem in One Paragraph
Adagrad: G_t = G_{t-1} + g_t². G grows monotonically. After t steps: G ≈ t × (mean g²). Effective lr = η/√G ≈ η/√t. For t=10,000: lr shrinks to 1% of η. For t=100,000: 0.3% of η. Adagrad is designed for convex optimization where this decay toward zero is acceptable. For deep neural networks trained for hundreds of thousands of steps with non-convex, non-stationary loss surfaces, learning must continue throughout training. RMSProp solves this.
The Fix — Exponential Moving Average
Instead of summing all squared gradients, RMSProp maintains a decaying average:
E[g²]t = β · E[g²]{t-1} + (1−β) · g_t²
θ ← θ − (η / √(E[g²]_t + ε)) · g_t
β = 0.9 (default). Compare:
- Adagrad: G_t = G_{t-1} + g_t² (running sum, never forgets)
- RMSProp: E_t = β·E_{t-1} + (1−β)·g_t² (exponential average, old info fades)
The weight given to a gradient k steps in the past is (1−β)·β^k. For β=0.9: the gradient 10 steps back has weight 0.1 × 0.9¹⁰ = 0.035. After 50 steps: weight = 0.1 × 0.9⁵⁰ ≈ 0. RMSProp effectively has a window of ~1/(1−β) = 10 steps.
Why Exponential Average Prevents lr Decay
With a constant gradient g per step, the steady-state value of E[g²]:
E_steady = (1−β)·g² × (1 + β + β² + ...) = (1−β)·g² × (1/(1−β)) = g²
Effective lr at steady state: η / √(g² + ε) ≈ η/g — a constant.
Adagrad at step t: η/√(t·g²) = η/(g√t) → 0. RMSProp at step t: η/g — stable.
Numerical Walkthrough (3 Steps, 2 Parameters)
Initial: w₁=0, w₂=0, E₁=0, E₂=0, η=0.1, β=0.9, ε=1e-8.
Step 1: g₁=0.5, g₂=0.0
E₁ = 0.9×0 + 0.1×0.5² = 0.1×0.25 = 0.025
E₂ = 0.9×0 + 0.1×0² = 0.000
lr₁ = 0.1 / √(0.025 + 1e-8) = 0.1 / 0.158 = 0.632
w₁ = 0 − 0.632 × 0.5 = −0.316 (Adagrad Step 1 gave −0.100 — RMSProp step is 3× larger)
Step 2: g₁=0.4, g₂=0.3
E₁ = 0.9×0.025 + 0.1×0.4² = 0.0225 + 0.016 = 0.0385
E₂ = 0.9×0.000 + 0.1×0.3² = 0 + 0.009 = 0.009
lr₁ = 0.1 / √0.0385 = 0.1 / 0.196 = 0.510
lr₂ = 0.1 / √0.009 = 0.1 / 0.095 = 1.054
w₁ = −0.316 − 0.510×0.4 = −0.316 − 0.204 = −0.520
w₂ = 0 − 1.054×0.3 = −0.316
Step 3: g₁=0.3, g₂=0.0
E₁ = 0.9×0.0385 + 0.1×0.09 = 0.0347 + 0.009 = 0.0437
Compare with Adagrad Step 3 G₁=0.500 vs RMSProp E₁=0.0437: RMSProp's accumulation is 11× smaller because it doesn't carry the full sum — only an 10-step window of history.
lr₁ = 0.1 / √0.0437 = 0.1 / 0.209 = 0.479
Compare: Adagrad Step 3 lr₁ = 0.141. RMSProp Step 3 lr₁ = 0.479 — 3.4× larger because the effective learning rate stabilizes rather than shrinking to zero.
Trace Table
| Step | g₁ | g₂ | E₁ | E₂ | lr₁ | lr₂ | w₁ | w₂ |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.5 | 0.0 | 0.0250 | 0.0000 | 0.632 | ~3162 | −0.316 | 0.000 |
| 2 | 0.4 | 0.3 | 0.0385 | 0.0090 | 0.510 | 1.054 | −0.520 | −0.316 |
| 3 | 0.3 | 0.0 | 0.0437 | 0.0081 | 0.479 | 1.111 | −0.663 | −0.316 |
Compare the trend of lr₁: 0.632 → 0.510 → 0.479. Still decreasing, but stabilizing. Adagrad's lr₁: 0.200 → 0.156 → 0.141 → ... → 0. RMSProp's lr stabilizes around η/√(steady-state E) rather than going to zero.
RMSProp vs Adagrad
| Property | Adagrad | RMSProp |
|---|---|---|
| Gradient history | Full sum: G = Σ g² | Exp. avg: E = β·E + (1−β)·g² |
| lr over time | → 0 always | Stabilizes around η/√(E_steady) |
| Memory of old gradients | Permanent | Fades by factor β per step |
| Best for | Sparse, convex, short runs | Non-convex, deep nets, long runs |
| Hyperparameters | η | η, β |
| Typical defaults | η=0.01 | η=0.001, β=0.9 |
β Sensitivity
import numpy as np
gradients = [(0.5, 0.0), (0.4, 0.3), (0.3, 0.0), (0.5, 0.2), (0.4, 0.0)]
def train_rmsprop(beta):
w = np.array([0.0, 0.0])
E = np.array([0.0, 0.0])
eta, eps = 0.1, 1e-8
for g1, g2 in gradients:
g = np.array([g1, g2])
E = beta * E + (1 - beta) * g ** 2
lr = eta / (np.sqrt(E) + eps)
w -= lr * g
return w
print(f"{'β':>5} | {'w1 after 5 steps':>18} | {'w2 after 5 steps':>18} | Notes")
print("-" * 70)
for beta, note in [(0.5, "fast reaction (noisy)"), (0.9, "default — balanced"), (0.99, "slow reaction to changes")]:
w = train_rmsprop(beta)
print(f"{beta:>5.2f} | {w[0]:>18.4f} | {w[1]:>18.4f} | {note}")β | w1 after 5 steps | w2 after 5 steps | Notes
----------------------------------------------------------------------
0.50 | -1.6321 | -0.9474 | fast reaction (noisy)
0.90 | -1.0632 | -0.6321 | default — balanced
0.99 | -0.3162 | -0.3157 | slow reaction to changesβ=0.5 reacts aggressively to new gradients — the recent history is short (~2 steps), so the effective lr adjusts quickly. This can cause oscillation if gradient magnitudes change frequently.
β=0.99 reacts very slowly — the ~100-step window means a new pattern takes 100 steps to fully register. This is stable but will underreact to distribution shifts in the loss landscape.
β=0.9 (default) maintains a ~10-step window — enough to smooth noise, fast enough to adapt to changes.
Code
import numpy as np
gradients = [(0.5, 0.0), (0.4, 0.3), (0.3, 0.0), (0.5, 0.2), (0.4, 0.0)]
w = np.array([0.0, 0.0])
E = np.array([0.0, 0.0])
eta, beta, eps = 0.1, 0.9, 1e-8
print(f"{'Step':>4} | {'g1':>5} {'g2':>5} | {'E1':>7} {'E2':>7} | {'lr1':>6} {'lr2':>6} | {'w1':>7} {'w2':>7}")
print("-" * 72)
for i, (g1, g2) in enumerate(gradients):
g = np.array([g1, g2])
E = beta * E + (1 - beta) * g ** 2
eff_lr = eta / (np.sqrt(E) + eps)
w -= eff_lr * g
print(f"{i+1:>4} | {g1:>5.2f} {g2:>5.2f} | {E[0]:>7.4f} {E[1]:>7.4f} | {eff_lr[0]:>6.4f} {eff_lr[1]:>6.4f} | {w[0]:>7.4f} {w[1]:>7.4f}")Step | g1 g2 | E1 E2 | lr1 lr2 | w1 w2
------------------------------------------------------------------------
1 | 0.50 0.00 | 0.0250 0.0000 | 0.6325 3162.3 | -0.3162 0.0000
2 | 0.40 0.30 | 0.0385 0.0090 | 0.5099 1.0541 | -0.5205 -0.3162
3 | 0.30 0.00 | 0.0437 0.0081 | 0.4784 1.1111 | -0.6637 -0.3162
4 | 0.50 0.20 | 0.0643 0.0133 | 0.3945 0.8672 | -0.8609 -0.4896
5 | 0.40 0.00 | 0.0979 0.0120 | 0.3194 0.9129 | -0.9887 -0.4896Compare lr₁ at Step 5: RMSProp = 0.319, Adagrad = 0.105 (from previous post, Step 5). RMSProp's lr is 3× larger because E doesn't grow as fast as G.
Related Concepts
Where this builds from: Adagrad (05) introduced the per-parameter adaptive learning rate. RMSProp is a direct modification of Adagrad — replace the running sum G with an exponential moving average E. Geoffrey Hinton introduced RMSProp in an unpublished Coursera lecture in 2012; it spread rapidly because it worked on recurrent neural networks where Adagrad's lr decay was particularly damaging.
Where this leads: Adam (07) = RMSProp (this post, providing v_t the second moment) + Momentum (04, providing m_t the first moment). The second moment v_t in Adam's equations is identical to E[g²]_t here — just renamed.
Honest Limitations
β too high → slow adaptation. With β=0.99, a sudden increase in gradient magnitude (e.g., from reaching a steep region of the loss surface) takes 100 steps to register. During those 100 steps, the effective lr is still calibrated to the old gradient scale — too large — causing oscillation or divergence.
RMSProp has no bias correction. At t=1, E₁ = (1−β)·g₁² — much smaller than the true mean squared gradient because it's initialized at 0. This causes the first-step effective lr to be very large (same ε trick as Adagrad). Adam adds bias correction to address this; RMSProp does not.
RMSProp is less popular than Adam. In practice, Adam is preferred for most deep learning tasks because the momentum term (which RMSProp lacks) provides additional trajectory smoothing. RMSProp is still used in RNNs and RL (where Adam's bias correction can cause issues in non-stationary settings), but for feedforward and CNN training, Adam is the default.
Test Your Understanding
-
Show that when β=1, RMSProp reduces to Adagrad (accumulating all history). When β=0, what does RMSProp's E[g²]_t reduce to, and what is the effective learning rate at each step?
-
At Step 3, E₁=0.0437 for RMSProp vs G₁=0.500 for Adagrad. The gradients at Steps 1, 2, 3 were 0.5, 0.4, 0.3. Verify the RMSProp E₁ value by hand using β=0.9. Then compute what E₁ would be if β=0.5. What does a smaller β do to the contribution of the Step 1 gradient?
-
RMSProp uses E[g²] where Adagrad uses G. Both are used in the denominator √(E+ε) or √(G+ε). If E stabilizes at steady state (E → g²_steady), what is the steady-state effective lr? Compare this to the effective lr of Adam (which also uses v̂_t = E[g²] at steady state but with bias correction). Are they the same?
-
Geoffrey Hinton introduced RMSProp in a 2012 Coursera lecture without a paper. It became widely used through word-of-mouth. Why might an algorithm spread through informal channels in machine learning, and what does this suggest about the relationship between theoretical guarantees and practical performance in DL optimization?
-
In reinforcement learning, the loss landscape changes as the policy improves — the gradient distribution is non-stationary. A low β (β=0.5) might adapt too quickly to local gradient noise, while β=0.99 is too slow. Propose a schedule for β over training: how would you change β over time to balance adaptation speed with gradient smoothing, and what would trigger a β change?