~/blog

RMSProp

Jul 1, 2026•8 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

Adagrad accumulates squared gradients forever. After enough training steps, the effective learning rate shrinks to zero for all parameters — the model freezes before reaching the minimum. RMSProp (Root Mean Square Propagation) fixes this with a single change: replace the running sum G_t with an exponentially weighted moving average E[g²]_t. Old gradients fade out; only recent gradient history matters.

Same anchor: w₁ (dense, gradient every step) and w₂ (sparse, gradient 1 in 5 steps). Same gradient sequence as the Adagrad post.

Adagrad's Problem in One Paragraph

Adagrad: G_t = G_{t-1} + g_t². G grows monotonically. After t steps: G ≈ t × (mean g²). Effective lr = η/√G ≈ η/√t. For t=10,000: lr shrinks to 1% of η. For t=100,000: 0.3% of η. Adagrad is designed for convex optimization where this decay toward zero is acceptable. For deep neural networks trained for hundreds of thousands of steps with non-convex, non-stationary loss surfaces, learning must continue throughout training. RMSProp solves this.

The Fix — Exponential Moving Average

Instead of summing all squared gradients, RMSProp maintains a decaying average:

E[g²]t = β · E[g²]{t-1} + (1−β) · g_t²

θ ← θ − (η / √(E[g²]_t + ε)) · g_t

β = 0.9 (default). Compare:

Adagrad: G_t = G_{t-1} + g_t² (running sum, never forgets)
RMSProp: E_t = β·E_{t-1} + (1−β)·g_t² (exponential average, old info fades)

The weight given to a gradient k steps in the past is (1−β)·β^k. For β=0.9: the gradient 10 steps back has weight 0.1 × 0.9¹⁰ = 0.035. After 50 steps: weight = 0.1 × 0.9⁵⁰ ≈ 0. RMSProp effectively has a window of ~1/(1−β) = 10 steps.

Why Exponential Average Prevents lr Decay

With a constant gradient g per step, the steady-state value of E[g²]:

E_steady = (1−β)·g² × (1 + β + β² + ...) = (1−β)·g² × (1/(1−β)) = g²

Effective lr at steady state: η / √(g² + ε) ≈ η/g — a constant.

Adagrad at step t: η/√(t·g²) = η/(g√t) → 0. RMSProp at step t: η/g — stable.

Numerical Walkthrough (3 Steps, 2 Parameters)

Initial: w₁=0, w₂=0, E₁=0, E₂=0, η=0.1, β=0.9, ε=1e-8.

Step 1: g₁=0.5, g₂=0.0

E₁ = 0.9×0 + 0.1×0.5² = 0.1×0.25 = 0.025

E₂ = 0.9×0 + 0.1×0² = 0.000

lr₁ = 0.1 / √(0.025 + 1e-8) = 0.1 / 0.158 = 0.632

w₁ = 0 − 0.632 × 0.5 = −0.316 (Adagrad Step 1 gave −0.100 — RMSProp step is 3× larger)

Step 2: g₁=0.4, g₂=0.3

E₁ = 0.9×0.025 + 0.1×0.4² = 0.0225 + 0.016 = 0.0385

E₂ = 0.9×0.000 + 0.1×0.3² = 0 + 0.009 = 0.009

lr₁ = 0.1 / √0.0385 = 0.1 / 0.196 = 0.510

lr₂ = 0.1 / √0.009 = 0.1 / 0.095 = 1.054

w₁ = −0.316 − 0.510×0.4 = −0.316 − 0.204 = −0.520

w₂ = 0 − 1.054×0.3 = −0.316

Step 3: g₁=0.3, g₂=0.0

E₁ = 0.9×0.0385 + 0.1×0.09 = 0.0347 + 0.009 = 0.0437

Compare with Adagrad Step 3 G₁=0.500 vs RMSProp E₁=0.0437: RMSProp's accumulation is 11× smaller because it doesn't carry the full sum — only an 10-step window of history.

lr₁ = 0.1 / √0.0437 = 0.1 / 0.209 = 0.479

Compare: Adagrad Step 3 lr₁ = 0.141. RMSProp Step 3 lr₁ = 0.479 — 3.4× larger because the effective learning rate stabilizes rather than shrinking to zero.

Trace Table

Step	g₁	g₂	E₁	E₂	lr₁	lr₂	w₁	w₂
1	0.5	0.0	0.0250	0.0000	0.632	~3162	−0.316	0.000
2	0.4	0.3	0.0385	0.0090	0.510	1.054	−0.520	−0.316
3	0.3	0.0	0.0437	0.0081	0.479	1.111	−0.663	−0.316

Compare the trend of lr₁: 0.632 → 0.510 → 0.479. Still decreasing, but stabilizing. Adagrad's lr₁: 0.200 → 0.156 → 0.141 → ... → 0. RMSProp's lr stabilizes around η/√(steady-state E) rather than going to zero.

RMSProp vs Adagrad

Property	Adagrad	RMSProp
Gradient history	Full sum: G = Σ g²	Exp. avg: E = β·E + (1−β)·g²
lr over time	→ 0 always	Stabilizes around η/√(E_steady)
Memory of old gradients	Permanent	Fades by factor β per step
Best for	Sparse, convex, short runs	Non-convex, deep nets, long runs
Hyperparameters	η	η, β
Typical defaults	η=0.01	η=0.001, β=0.9

β Sensitivity

python

import numpy as np

gradients = [(0.5, 0.0), (0.4, 0.3), (0.3, 0.0), (0.5, 0.2), (0.4, 0.0)]

def train_rmsprop(beta):
    w = np.array([0.0, 0.0])
    E = np.array([0.0, 0.0])
    eta, eps = 0.1, 1e-8
    for g1, g2 in gradients:
        g  = np.array([g1, g2])
        E  = beta * E + (1 - beta) * g ** 2
        lr = eta / (np.sqrt(E) + eps)
        w -= lr * g
    return w

print(f"{'β':>5} | {'w1 after 5 steps':>18} | {'w2 after 5 steps':>18} | Notes")
print("-" * 70)
for beta, note in [(0.5, "fast reaction (noisy)"), (0.9, "default — balanced"), (0.99, "slow reaction to changes")]:
    w = train_rmsprop(beta)
    print(f"{beta:>5.2f} | {w[0]:>18.4f} | {w[1]:>18.4f} | {note}")

text

β |  w1 after 5 steps |  w2 after 5 steps | Notes
----------------------------------------------------------------------
 0.50 |            -1.6321 |            -0.9474 | fast reaction (noisy)
 0.90 |            -1.0632 |            -0.6321 | default — balanced
 0.99 |            -0.3162 |            -0.3157 | slow reaction to changes

β=0.5 reacts aggressively to new gradients — the recent history is short (~2 steps), so the effective lr adjusts quickly. This can cause oscillation if gradient magnitudes change frequently.

β=0.99 reacts very slowly — the ~100-step window means a new pattern takes 100 steps to fully register. This is stable but will underreact to distribution shifts in the loss landscape.

β=0.9 (default) maintains a ~10-step window — enough to smooth noise, fast enough to adapt to changes.

Code

python

import numpy as np

gradients = [(0.5, 0.0), (0.4, 0.3), (0.3, 0.0), (0.5, 0.2), (0.4, 0.0)]
w   = np.array([0.0, 0.0])
E   = np.array([0.0, 0.0])
eta, beta, eps = 0.1, 0.9, 1e-8

print(f"{'Step':>4} | {'g1':>5} {'g2':>5} | {'E1':>7} {'E2':>7} | {'lr1':>6} {'lr2':>6} | {'w1':>7} {'w2':>7}")
print("-" * 72)
for i, (g1, g2) in enumerate(gradients):
    g       = np.array([g1, g2])
    E       = beta * E + (1 - beta) * g ** 2
    eff_lr  = eta / (np.sqrt(E) + eps)
    w      -= eff_lr * g
    print(f"{i+1:>4} | {g1:>5.2f} {g2:>5.2f} | {E[0]:>7.4f} {E[1]:>7.4f} | {eff_lr[0]:>6.4f} {eff_lr[1]:>6.4f} | {w[0]:>7.4f} {w[1]:>7.4f}")

text

Step |    g1    g2 |      E1      E2 |    lr1    lr2 |      w1      w2
------------------------------------------------------------------------
   1 |  0.50  0.00 |  0.0250  0.0000 | 0.6325 3162.3 | -0.3162  0.0000
   2 |  0.40  0.30 |  0.0385  0.0090 | 0.5099 1.0541 | -0.5205 -0.3162
   3 |  0.30  0.00 |  0.0437  0.0081 | 0.4784 1.1111 | -0.6637 -0.3162
   4 |  0.50  0.20 |  0.0643  0.0133 | 0.3945 0.8672 | -0.8609 -0.4896
   5 |  0.40  0.00 |  0.0979  0.0120 | 0.3194 0.9129 | -0.9887 -0.4896

Compare lr₁ at Step 5: RMSProp = 0.319, Adagrad = 0.105 (from previous post, Step 5). RMSProp's lr is 3× larger because E doesn't grow as fast as G.

Where this builds from: Adagrad (05) introduced the per-parameter adaptive learning rate. RMSProp is a direct modification of Adagrad — replace the running sum G with an exponential moving average E. Geoffrey Hinton introduced RMSProp in an unpublished Coursera lecture in 2012; it spread rapidly because it worked on recurrent neural networks where Adagrad's lr decay was particularly damaging.

Where this leads: Adam (07) = RMSProp (this post, providing v_t the second moment) + Momentum (04, providing m_t the first moment). The second moment v_t in Adam's equations is identical to E[g²]_t here — just renamed.

Honest Limitations

β too high → slow adaptation. With β=0.99, a sudden increase in gradient magnitude (e.g., from reaching a steep region of the loss surface) takes 100 steps to register. During those 100 steps, the effective lr is still calibrated to the old gradient scale — too large — causing oscillation or divergence.

RMSProp has no bias correction. At t=1, E₁ = (1−β)·g₁² — much smaller than the true mean squared gradient because it's initialized at 0. This causes the first-step effective lr to be very large (same ε trick as Adagrad). Adam adds bias correction to address this; RMSProp does not.

RMSProp is less popular than Adam. In practice, Adam is preferred for most deep learning tasks because the momentum term (which RMSProp lacks) provides additional trajectory smoothing. RMSProp is still used in RNNs and RL (where Adam's bias correction can cause issues in non-stationary settings), but for feedforward and CNN training, Adam is the default.

Test Your Understanding

Show that when β=1, RMSProp reduces to Adagrad (accumulating all history). When β=0, what does RMSProp's E[g²]_t reduce to, and what is the effective learning rate at each step?
At Step 3, E₁=0.0437 for RMSProp vs G₁=0.500 for Adagrad. The gradients at Steps 1, 2, 3 were 0.5, 0.4, 0.3. Verify the RMSProp E₁ value by hand using β=0.9. Then compute what E₁ would be if β=0.5. What does a smaller β do to the contribution of the Step 1 gradient?
RMSProp uses E[g²] where Adagrad uses G. Both are used in the denominator √(E+ε) or √(G+ε). If E stabilizes at steady state (E → g²_steady), what is the steady-state effective lr? Compare this to the effective lr of Adam (which also uses v̂_t = E[g²] at steady state but with bias correction). Are they the same?
Geoffrey Hinton introduced RMSProp in a 2012 Coursera lecture without a paper. It became widely used through word-of-mouth. Why might an algorithm spread through informal channels in machine learning, and what does this suggest about the relationship between theoretical guarantees and practical performance in DL optimization?
In reinforcement learning, the loss landscape changes as the policy improves — the gradient distribution is non-stationary. A low β (β=0.5) might adapt too quickly to local gradient noise, while β=0.99 is too slow. Propose a schedule for β over training: how would you change β over time to balance adaptation speed with gradient smoothing, and what would trigger a β change?

RMSProp

Adagrad's Problem in One Paragraph

The Fix — Exponential Moving Average

Why Exponential Average Prevents lr Decay

Numerical Walkthrough (3 Steps, 2 Parameters)

Trace Table

RMSProp vs Adagrad

β Sensitivity

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

RMSProp

Adagrad's Problem in One Paragraph

The Fix — Exponential Moving Average

Why Exponential Average Prevents lr Decay

Numerical Walkthrough (3 Steps, 2 Parameters)

Trace Table

RMSProp vs Adagrad

β Sensitivity

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment