Adam (Adaptive Moment Estimation) is the default optimizer for most deep learning tasks. It combines two ideas from the previous posts: momentum's exponentially weighted average of gradients (to smooth the direction), and RMSProp's exponentially weighted average of squared gradients (to adapt the learning rate per parameter). It adds one more piece that neither predecessor has: bias correction for the first few steps.
Anchor: same 2-parameter model. w₁=0, w₂=0. η=0.001, β₁=0.9, β₂=0.999, ε=1e-8.
What Adam Combines
| Component | Source | What it does |
|---|---|---|
| m_t (first moment) | Momentum | Exponentially weighted avg of gradients |
| v_t (second moment) | RMSProp | Exponentially weighted avg of squared gradients |
| Bias correction | Adam-specific | Prevents underestimation at small t |
| η/√v̂·m̂ | Combined | Per-parameter adaptive lr with momentum |
SGD: θ ← θ − η·g. Adam: θ ← θ − η·(m̂/√v̂).
The 5 Update Equations
1. m_t = β₁·m_{t-1} + (1−β₁)·g_t (first moment — like momentum)
2. v_t = β₂·v_{t-1} + (1−β₂)·g_t² (second moment — like RMSProp)
3. m̂_t = m_t / (1 − β₁^t) (bias correction for first moment)
4. v̂_t = v_t / (1 − β₂^t) (bias correction for second moment)
5. θ_t ← θ_{t-1} − η · m̂_t / (√v̂_t + ε)
Why Bias Correction?
At t=1, with m₀=v₀=0:
m₁ = β₁·0 + (1−β₁)·g₁ = (1−β₁)·g₁
With β₁=0.9 and g₁=0.5: m₁ = 0.1 × 0.5 = 0.05
If we used m₁ directly as the gradient estimate, we'd be using 0.05 as an estimate of g₁=0.5 — a 10× underestimate. The optimizer takes a step that is 10× smaller than it should.
Bias correction: m̂₁ = m₁ / (1 − β₁¹) = 0.05 / (1 − 0.9) = 0.05 / 0.1 = 0.5 — restored to the scale of g₁.
At t=1: divisor = (1−0.9¹) = 0.1. At t=2: divisor = (1−0.81) = 0.19. At t=10: divisor = (1−0.9¹⁰) = 0.65. At t=100: divisor ≈ 1.0. After ~50 steps, bias correction becomes negligible and Adam behaves like momentum + RMSProp. The correction only matters at the start.
For v̂: at t=1 with β₂=0.999: divisor = (1−0.999) = 0.001. v̂₁ = v₁ / 0.001 — amplified 1000×. This is why Adam's first step is not tiny even though both m and v are initialized at 0.
Numerical Walkthrough (3 Steps for w₁)
Starting: m=0, v=0, w=0. Gradient sequence g₁=0.5, g₂=0.4, g₃=0.3.
t=1, g=0.5:
m₁ = 0.9×0 + 0.1×0.5 = 0.0500
v₁ = 0.999×0 + 0.001×0.5² = 0.001×0.25 = 0.000250
m̂₁ = 0.0500 / (1 − 0.9¹) = 0.0500 / 0.100 = 0.5000
v̂₁ = 0.000250 / (1 − 0.999¹) = 0.000250 / 0.001 = 0.2500
w₁ = 0 − 0.001 × 0.5000 / (√0.2500 + 1e-8) = 0 − 0.001 × 0.5000 / 0.5000 = −0.001000
t=2, g=0.4:
m₂ = 0.9×0.0500 + 0.1×0.4 = 0.0450 + 0.040 = 0.0850
v₂ = 0.999×0.000250 + 0.001×0.4² = 0.000250 + 0.000160 = 0.000410
m̂₂ = 0.0850 / (1 − 0.9²) = 0.0850 / (1 − 0.81) = 0.0850 / 0.19 = 0.4474
v̂₂ = 0.000410 / (1 − 0.999²) = 0.000410 / 0.001999 = 0.2051
w₁ = −0.001000 − 0.001 × 0.4474 / (√0.2051 + 1e-8) = −0.001000 − 0.001 × 0.4474 / 0.4529 = −0.001000 − 0.000988 = −0.001988
t=3, g=0.3:
m₃ = 0.9×0.0850 + 0.1×0.3 = 0.0765 + 0.030 = 0.1065
v₃ = 0.999×0.000410 + 0.001×0.09 = 0.000410 + 0.000090 = 0.000500
m̂₃ = 0.1065 / (1 − 0.9³) = 0.1065 / (1 − 0.729) = 0.1065 / 0.271 = 0.3930
v̂₃ = 0.000500 / (1 − 0.999³) = 0.000500 / 0.002997 = 0.1668
w₁ = −0.001988 − 0.001 × 0.3930 / (√0.1668 + 1e-8) = −0.001988 − 0.001 × 0.3930 / 0.4084 = −0.001988 − 0.000963 = −0.002951
Trace Table
| t | g | m_t | v_t | m̂_t | v̂_t | w_t |
|---|---|---|---|---|---|---|
| 1 | 0.5 | 0.0500 | 0.000250 | 0.5000 | 0.2500 | −0.001000 |
| 2 | 0.4 | 0.0850 | 0.000410 | 0.4474 | 0.2051 | −0.001988 |
| 3 | 0.3 | 0.1065 | 0.000500 | 0.3930 | 0.1668 | −0.002951 |
Each step moves by ~0.001 = η. This is by design: the m̂/√v̂ ratio normalizes the effective gradient to have unit magnitude, so Adam moves by approximately η per step regardless of gradient scale.
Why Adam is the Default
Adam converges faster than the others because it combines both advantages simultaneously:
- Per-parameter lr (from RMSProp): parameters with large, consistent gradients get smaller steps; sparse parameters get larger steps
- Momentum (from first moment): consistent directions accumulate; oscillating directions cancel
- Bias correction: the optimizer takes full-sized steps from the very first update
Robust to noisy gradients, sparse features, non-stationary objectives, and a wide range of learning rates (η=0.001 works across many tasks without tuning).
When Adam Might Not Be Best
Generalization gap on image classification. Models trained with SGD + momentum sometimes generalize better than Adam on tasks like CIFAR-10 and ImageNet. The reason is the same as the large-batch SGD problem: Adam's adaptive lr causes it to converge to sharper minima that memorize training data. SGD + momentum, with its constant lr and noisier gradient estimates, tends to find wider, flatter minima that generalize better.
Practical rule: Use Adam for NLP, RNNs, transformers, and when you don't want to tune η carefully. Use SGD + momentum for CNNs on image classification when you have time to tune the lr schedule (often gives 1–2% better test accuracy than Adam).
Hyperparameter Guide
| Parameter | Default | When to change | Notes |
|---|---|---|---|
| η | 0.001 | Most important to tune | Start here; try 3e-4, 1e-3, 3e-3 |
| β₁ | 0.9 | Rarely | Reduces to RMSProp if β₁=0 |
| β₂ | 0.999 | Rarely | 0.99 for noisy objectives (RL) |
| ε | 1e-8 | If NaN | 1e-4 if numerical instability |
Code
import numpy as np
def adam(grad_fn, theta0, eta=0.001, beta1=0.9, beta2=0.999, eps=1e-8, steps=5):
theta = np.array(theta0, dtype=float)
m = np.zeros_like(theta)
v = np.zeros_like(theta)
for t in range(1, steps + 1):
g = grad_fn(theta)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g ** 2
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
theta -= eta * m_hat / (np.sqrt(v_hat) + eps)
print(f"t={t}: g={g.round(4)}, m={m.round(4)}, v={v.round(6)}, θ={theta.round(4)}")
return theta
# Minimize f(θ) = θ² — true minimum at θ=0
grad_fn = lambda theta: 2 * theta
print("Minimizing f(θ) = θ²:")
adam(grad_fn, [1.0])Minimizing f(θ) = θ²:
t=1: g=[2.], m=[0.2], v=[0.004], θ=[0.999]
t=2: g=[1.998], m=[0.3798], v=[0.007992], θ=[0.998]
t=3: g=[1.996], m=[0.5416], v=[0.011968], θ=[0.997]
t=4: g=[1.994], m=[0.6875], v=[0.015928], θ=[0.996]
t=5: g=[1.992], m=[0.8187], v=[0.019872], θ=[0.995]Each step moves θ by ~0.001 (= η). The gradient is 2θ ≈ 2.0 throughout the first few steps; Adam normalizes this to approximately unit magnitude and takes step η. After many steps, as θ approaches 0, the gradient shrinks and the step size also shrinks — the optimizer slows down naturally near the minimum.
Related Concepts
Where this builds from: Momentum (04) provides the first moment m_t — a smoothed gradient. RMSProp (06) provides the second moment v_t — an adaptive denominator. Adam wraps both with bias correction. Understanding each piece separately makes Adam's equations fully derivable rather than memorizable.
Where this leads: AdamW (Adam + decoupled weight decay) is the modern default for transformer training. L2 regularization in Adam is applied to the gradient before moment estimation, which distorts the adaptive lr. AdamW applies weight decay directly to the parameters, keeping the adaptive lr clean. Most GPT-family models use AdamW.
Honest Limitations
Convergence to sharp minima. Adam's adaptive per-parameter lr allows it to fit the training set very efficiently — sometimes too efficiently. For tasks where generalization matters more than train loss (most real tasks), SGD + momentum with a tuned learning rate schedule often produces better final test performance. This has been documented on CIFAR-10/100 and ImageNet since 2017.
ε matters in practice. The default ε=1e-8 can cause numerical issues when v̂ is very small (parameters with near-zero squared gradients). Setting ε=1e-4 is recommended for models with many small gradients (e.g., transformers with gradient clipping). The wrong ε causes the optimizer to take enormous steps on dormant parameters.
Adam is not theoretically proven to converge in general. The original Adam paper proved convergence for convex objectives. For non-convex neural networks, there are known counterexamples where Adam diverges (AMSGrad was proposed to fix this by using the maximum of past v̂ values). In practice, Adam converges reliably, but the theoretical guarantees are weaker than SGD's.
Test Your Understanding
-
At t=1 with g=0.5, bias-corrected m̂₁=0.5 and v̂₁=0.25. The step is η × m̂₁/√v̂₁ = 0.001 × 0.5/0.5 = 0.001. Now suppose the gradient is 10× larger: g=5.0. Compute m̂₁ and v̂₁ with this new gradient. What is the step size? Does Adam step 10× larger for a 10× larger gradient? Why or why not?
-
The bias correction divisor at t=1 for β₁=0.9 is (1−0.9) = 0.1, giving m̂ = m/0.1 = 10m. For β₂=0.999, the divisor is (1−0.999) = 0.001, giving v̂ = v/0.001 = 1000v. This means at t=1, v is amplified 1000× more than m. Why is this important for the first step's stability?
-
Adam's step equation is θ ← θ − η·m̂/√v̂. Suppose you set β₁=0 (no momentum) and β₂=0 (no second moment history). Show what m̂ and v̂ reduce to at each step t. What optimizer does Adam reduce to in this case?
-
The "generalization gap" between Adam and SGD+momentum on ImageNet suggests Adam converges to sharper minima. Assuming this is caused by Adam's per-parameter lr (not its momentum), propose an experiment that would verify this: what specific change to Adam's hyperparameters would make it more likely to find wider minima?
-
AdamW decouples weight decay from the gradient update: θ ← θ − η·(m̂/√v̂) − η·λ·θ. Standard Adam applies L2 regularization by adding λ·θ to the gradient g before computing moments. Explain why these two approaches are different. For a parameter with a large v̂ (large accumulated squared gradient), how does the weight decay contribution relative to the gradient-driven update differ between Adam-L2 and AdamW?