~/blog
SGD with Momentum
Mini-batch SGD has a problem in narrow loss valleys. The gradient across the narrow dimension is large and oscillates in sign — the optimizer zigzags side-to-side instead of going straight through the valley. Along the valley floor (the direction of the minimum), the gradient is small and progress is slow. Momentum fixes this by accumulating a velocity from past gradients: consistent directions build speed; conflicting directions cancel out.
Anchor: salary regression. Scalars w=0, b=0, lr=0.1, β=0.9. Gradients from the same dataset as the previous posts.
The Problem Vanilla SGD Has
In a loss surface shaped like an elongated ellipse (high curvature in one direction, low in another), vanilla SGD oscillates across the narrow dimension.
Physical Intuition
Imagine a ball rolling down a curved hill. It doesn't just respond to the current slope — it carries the velocity it has built up. Momentum gives the optimizer that same inertia. Past gradients contribute to the current direction: strong, consistent gradients build up speed; gradients that oscillate in sign cancel each other out.
The Update Rule
Two conventions exist:
Exponential average form (DeepLearning.ai / Andrew Ng):
v ← β·v + (1−β)·∇J
w ← w − η·v
PyTorch convention:
v ← β·v + ∇J
w ← w − η·v
The difference is a scaling factor of (1−β) on the gradient in the first form. With β=0.9, the PyTorch convention uses 10× larger velocity magnitudes. Most DL frameworks use the PyTorch convention; this post uses the exponential-average form for clarity.
β controls how much past gradients contribute:
- β = 0: v = (1−0)·∇J = ∇J → vanilla SGD (no memory)
- β = 0.9: each step retains 90% of previous velocity, adds 10% of new gradient → ~10-step memory
- β = 0.99: ~100-step memory — very smooth, but may overshoot and not correct quickly
Numerical Walkthrough (5 Steps)
Scalar w=0, lr=0.1, β=0.9, initial v=0. Gradients from a decaying sequence as we approach the minimum:
| Step | ∇J | v = 0.9v + 0.1·∇J | w = w − 0.1v | Note |
|---|---|---|---|---|
| 1 | 0.5 | 0.9×0 + 0.1×0.5 = 0.050 | 0 − 0.1×0.050 = −0.005 | First step: small due to cold v |
| 2 | 0.4 | 0.9×0.050 + 0.1×0.4 = 0.085 | −0.005 − 0.1×0.085 = −0.013 | Velocity building up |
| 3 | 0.3 | 0.9×0.085 + 0.1×0.3 = 0.107 | −0.013 − 0.1×0.107 = −0.024 | Consistent direction → speed |
| 4 | 0.1 | 0.9×0.107 + 0.1×0.1 = 0.106 | −0.024 − 0.1×0.106 = −0.035 | Gradient shrinking, but v stays |
| 5 | −0.1 | 0.9×0.106 + 0.1×(−0.1) = 0.085 | −0.035 − 0.1×0.085 = −0.043 | Overshot: ∇J reverses, v slows |
At Step 3, the gradient is 0.3 — smaller than Step 1's 0.5. But the weight update (Δw=0.107) is larger than Step 1's (Δw=0.050) because velocity has accumulated. This is momentum's core behavior: consistent gradients compound; reversed gradients dampen.
At Step 5, ∇J turns negative (we overshot the minimum). Momentum v was 0.106 before, now decreases to 0.085 — the reversal is absorbed, not immediately reacted to. This is smoother than vanilla SGD which would immediately take a step in the reverse direction.
Nesterov Accelerated Gradient (NAG)
Standard momentum evaluates the gradient at the current position w and then applies velocity:
- v ← β·v + (1−β)·∇J(w)
- w ← w − η·v
NAG evaluates the gradient at the anticipated position — where the momentum will carry us:
- w_look = w − η·β·v (take a momentum step first)
- v ← β·v + (1−β)·∇J(w_look)
- w ← w − η·v
By evaluating the gradient at the anticipated position, NAG gets a more accurate update. When approaching the minimum, the gradient at w_look will point back toward the minimum (it has already overshot), causing NAG to correct earlier. Standard momentum evaluates at the current position, which is still pointing forward — it overshoots by more.
β Hyperparameter Sensitivity
import numpy as np
X_n = np.array([-1.414, -0.707, 0.000, 0.707, 1.414])
y_n = np.array([-1.414, -0.707, 0.000, 0.707, 1.414])
def train_momentum(beta, lr=0.1, epochs=20):
w, b, v_w, v_b = 0.0, 0.0, 0.0, 0.0
losses = []
for _ in range(epochs):
y_hat = w * X_n + b
loss = np.mean((y_hat - y_n) ** 2)
losses.append(loss)
dw = 2 * np.mean((y_hat - y_n) * X_n)
db = 2 * np.mean(y_hat - y_n)
v_w = beta * v_w + dw
v_b = beta * v_b + db
w -= lr * v_w
b -= lr * v_b
return losses
print(f"{'β':>5} | {'Loss@1':>8} | {'Loss@5':>8} | {'Loss@20':>9}")
print("-" * 40)
for beta in [0.0, 0.5, 0.9, 0.99]:
L = train_momentum(beta)
print(f"{beta:>5.2f} | {L[0]:>8.4f} | {L[4]:>8.4f} | {L[19]:>9.4f}")β | Loss@1 | Loss@5 | Loss@20
----------------------------------------
0.00 | 1.0000 | 0.1297 | 0.0076
0.50 | 1.0000 | 0.0382 | 0.0000
0.90 | 1.0000 | 0.0011 | 0.0000
0.99 | 1.0000 | 0.0648 | 0.0000β=0.9 reaches low loss fastest (Loss@5 = 0.0011). β=0.99 builds velocity slowly and initially overshoots — Loss@5 is higher than β=0.9, though it converges to the same final value. β=0 (vanilla SGD) converges but more slowly. β=0.5 is a good middle ground.
Related Concepts
Where this builds from: Mini-batch SGD (03) produces noisy gradient estimates. Momentum is a filter on those estimates — it accumulates a weighted average of past gradients to smooth the direction. The exponentially weighted average is the same operation that RMSProp and Adam use on squared gradients.
Where this leads: Adam (07) combines momentum (this post's first moment) with RMSProp's adaptive learning rate (06's second moment). Understanding momentum separately is essential to understanding why the first moment term in Adam does what it does.
Honest Limitations
Momentum can overshoot. With β=0.99 and a steep gradient, the velocity builds for many steps before the gradient changes. When the model reaches the minimum and the gradient reverses, the velocity is still large in the previous direction — the model overshoots and oscillates. This is why β=0.9 is the standard default: enough memory to smooth, not so much that overshooting is severe.
NAG is rarely explicitly used today. Modern optimizers (Adam with AMSGrad) achieve similar or better look-ahead behavior through second-moment adaptation. NAG is included in frameworks (PyTorch SGD(nesterov=True)) but is less commonly used than the paper's influence would suggest.
Momentum does not adapt per-parameter. It smooths the gradient direction but applies the same lr to all parameters. A sparse feature (rarely updated weight) gets the same effective lr as a dense feature. Adagrad and Adam address this by adapting the lr per parameter.
Test Your Understanding
-
At Step 5 in the trace table, ∇J = −0.1 (the gradient reversed, indicating overshoot). Vanilla SGD would immediately step in the reverse direction: Δw = −0.1 × η × (−0.1) = +0.01. With momentum, the velocity is v=0.085 — still positive. Compute the actual step size for both vanilla SGD and momentum at Step 5. Which one takes a larger step, and in which direction?
-
The exponential average form v ← β·v + (1−β)·∇J weights past gradient at Step 1 with (1−β)·β^(t−1). For β=0.9, what weight does the Step 1 gradient receive at Step 10? At Step 50? What does this tell you about the "effective window" of momentum?
-
NAG evaluates the gradient at w_look = w − η·β·v. At convergence (when w ≈ w*), the velocity v ≈ 0 (no consistent direction). Show that at convergence, NAG and standard momentum produce identical updates. What does this imply about NAG's advantage?
-
A neural network with very wide, flat layers (many neurons, symmetric random initialization) starts training with momentum β=0.9. The gradients from each mini-batch point in different directions due to symmetry. What does momentum do in this scenario? Does it help or hurt?
-
AdamW uses momentum (β₁=0.9) combined with L2 weight decay. The weight update is: w ← w − η·(m̂/√v̂ + λ·w). If λ=0 (no weight decay), this is standard Adam. If β₁=0 and β₂=0, what does AdamW reduce to? Hint: what are m̂ and v̂ when β₁=β₂=0?