~/blog

Adam Optimizer

Jul 1, 2026•9 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

Adam (Adaptive Moment Estimation) is the default optimizer for most deep learning tasks. It combines two ideas from the previous posts: momentum's exponentially weighted average of gradients (to smooth the direction), and RMSProp's exponentially weighted average of squared gradients (to adapt the learning rate per parameter). It adds one more piece that neither predecessor has: bias correction for the first few steps.

Anchor: same 2-parameter model. w₁=0, w₂=0. η=0.001, β₁=0.9, β₂=0.999, ε=1e-8.

What Adam Combines

Component	Source	What it does
m_t (first moment)	Momentum	Exponentially weighted avg of gradients
v_t (second moment)	RMSProp	Exponentially weighted avg of squared gradients
Bias correction	Adam-specific	Prevents underestimation at small t
η/√v̂·m̂	Combined	Per-parameter adaptive lr with momentum

SGD: θ ← θ − η·g. Adam: θ ← θ − η·(m̂/√v̂).

The 5 Update Equations

1. m_t = β₁·m_{t-1} + (1−β₁)·g_t (first moment — like momentum)

2. v_t = β₂·v_{t-1} + (1−β₂)·g_t² (second moment — like RMSProp)

3. m̂_t = m_t / (1 − β₁^t) (bias correction for first moment)

4. v̂_t = v_t / (1 − β₂^t) (bias correction for second moment)

5. θ_t ← θ_{t-1} − η · m̂_t / (√v̂_t + ε)

Why Bias Correction?

At t=1, with m₀=v₀=0:

m₁ = β₁·0 + (1−β₁)·g₁ = (1−β₁)·g₁

With β₁=0.9 and g₁=0.5: m₁ = 0.1 × 0.5 = 0.05

If we used m₁ directly as the gradient estimate, we'd be using 0.05 as an estimate of g₁=0.5 — a 10× underestimate. The optimizer takes a step that is 10× smaller than it should.

Bias correction: m̂₁ = m₁ / (1 − β₁¹) = 0.05 / (1 − 0.9) = 0.05 / 0.1 = 0.5 — restored to the scale of g₁.

At t=1: divisor = (1−0.9¹) = 0.1. At t=2: divisor = (1−0.81) = 0.19. At t=10: divisor = (1−0.9¹⁰) = 0.65. At t=100: divisor ≈ 1.0. After ~50 steps, bias correction becomes negligible and Adam behaves like momentum + RMSProp. The correction only matters at the start.

For v̂: at t=1 with β₂=0.999: divisor = (1−0.999) = 0.001. v̂₁ = v₁ / 0.001 — amplified 1000×. This is why Adam's first step is not tiny even though both m and v are initialized at 0.

Numerical Walkthrough (3 Steps for w₁)

Starting: m=0, v=0, w=0. Gradient sequence g₁=0.5, g₂=0.4, g₃=0.3.

t=1, g=0.5:

m₁ = 0.9×0 + 0.1×0.5 = 0.0500

v₁ = 0.999×0 + 0.001×0.5² = 0.001×0.25 = 0.000250

m̂₁ = 0.0500 / (1 − 0.9¹) = 0.0500 / 0.100 = 0.5000

v̂₁ = 0.000250 / (1 − 0.999¹) = 0.000250 / 0.001 = 0.2500

w₁ = 0 − 0.001 × 0.5000 / (√0.2500 + 1e-8) = 0 − 0.001 × 0.5000 / 0.5000 = −0.001000

t=2, g=0.4:

m₂ = 0.9×0.0500 + 0.1×0.4 = 0.0450 + 0.040 = 0.0850

v₂ = 0.999×0.000250 + 0.001×0.4² = 0.000250 + 0.000160 = 0.000410

m̂₂ = 0.0850 / (1 − 0.9²) = 0.0850 / (1 − 0.81) = 0.0850 / 0.19 = 0.4474

v̂₂ = 0.000410 / (1 − 0.999²) = 0.000410 / 0.001999 = 0.2051

w₁ = −0.001000 − 0.001 × 0.4474 / (√0.2051 + 1e-8) = −0.001000 − 0.001 × 0.4474 / 0.4529 = −0.001000 − 0.000988 = −0.001988

t=3, g=0.3:

m₃ = 0.9×0.0850 + 0.1×0.3 = 0.0765 + 0.030 = 0.1065

v₃ = 0.999×0.000410 + 0.001×0.09 = 0.000410 + 0.000090 = 0.000500

m̂₃ = 0.1065 / (1 − 0.9³) = 0.1065 / (1 − 0.729) = 0.1065 / 0.271 = 0.3930

v̂₃ = 0.000500 / (1 − 0.999³) = 0.000500 / 0.002997 = 0.1668

w₁ = −0.001988 − 0.001 × 0.3930 / (√0.1668 + 1e-8) = −0.001988 − 0.001 × 0.3930 / 0.4084 = −0.001988 − 0.000963 = −0.002951

Trace Table

t	g	m_t	v_t	m̂_t	v̂_t	w_t
1	0.5	0.0500	0.000250	0.5000	0.2500	−0.001000
2	0.4	0.0850	0.000410	0.4474	0.2051	−0.001988
3	0.3	0.1065	0.000500	0.3930	0.1668	−0.002951

Each step moves by ~0.001 = η. This is by design: the m̂/√v̂ ratio normalizes the effective gradient to have unit magnitude, so Adam moves by approximately η per step regardless of gradient scale.

Why Adam is the Default

Adam converges faster than the others because it combines both advantages simultaneously:

Per-parameter lr (from RMSProp): parameters with large, consistent gradients get smaller steps; sparse parameters get larger steps
Momentum (from first moment): consistent directions accumulate; oscillating directions cancel
Bias correction: the optimizer takes full-sized steps from the very first update

Robust to noisy gradients, sparse features, non-stationary objectives, and a wide range of learning rates (η=0.001 works across many tasks without tuning).

When Adam Might Not Be Best

Generalization gap on image classification. Models trained with SGD + momentum sometimes generalize better than Adam on tasks like CIFAR-10 and ImageNet. The reason is the same as the large-batch SGD problem: Adam's adaptive lr causes it to converge to sharper minima that memorize training data. SGD + momentum, with its constant lr and noisier gradient estimates, tends to find wider, flatter minima that generalize better.

Practical rule: Use Adam for NLP, RNNs, transformers, and when you don't want to tune η carefully. Use SGD + momentum for CNNs on image classification when you have time to tune the lr schedule (often gives 1–2% better test accuracy than Adam).

Hyperparameter Guide

Parameter	Default	When to change	Notes
η	0.001	Most important to tune	Start here; try 3e-4, 1e-3, 3e-3
β₁	0.9	Rarely	Reduces to RMSProp if β₁=0
β₂	0.999	Rarely	0.99 for noisy objectives (RL)
ε	1e-8	If NaN	1e-4 if numerical instability

Code

python

import numpy as np

def adam(grad_fn, theta0, eta=0.001, beta1=0.9, beta2=0.999, eps=1e-8, steps=5):
    theta = np.array(theta0, dtype=float)
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    for t in range(1, steps + 1):
        g     = grad_fn(theta)
        m     = beta1 * m + (1 - beta1) * g
        v     = beta2 * v + (1 - beta2) * g ** 2
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        theta -= eta * m_hat / (np.sqrt(v_hat) + eps)
        print(f"t={t}: g={g.round(4)}, m={m.round(4)}, v={v.round(6)}, θ={theta.round(4)}")
    return theta

# Minimize f(θ) = θ² — true minimum at θ=0
grad_fn = lambda theta: 2 * theta
print("Minimizing f(θ) = θ²:")
adam(grad_fn, [1.0])

text

Minimizing f(θ) = θ²:
t=1: g=[2.], m=[0.2], v=[0.004], θ=[0.999]
t=2: g=[1.998], m=[0.3798], v=[0.007992], θ=[0.998]
t=3: g=[1.996], m=[0.5416], v=[0.011968], θ=[0.997]
t=4: g=[1.994], m=[0.6875], v=[0.015928], θ=[0.996]
t=5: g=[1.992], m=[0.8187], v=[0.019872], θ=[0.995]

Each step moves θ by ~0.001 (= η). The gradient is 2θ ≈ 2.0 throughout the first few steps; Adam normalizes this to approximately unit magnitude and takes step η. After many steps, as θ approaches 0, the gradient shrinks and the step size also shrinks — the optimizer slows down naturally near the minimum.

Where this builds from: Momentum (04) provides the first moment m_t — a smoothed gradient. RMSProp (06) provides the second moment v_t — an adaptive denominator. Adam wraps both with bias correction. Understanding each piece separately makes Adam's equations fully derivable rather than memorizable.

Where this leads: AdamW (Adam + decoupled weight decay) is the modern default for transformer training. L2 regularization in Adam is applied to the gradient before moment estimation, which distorts the adaptive lr. AdamW applies weight decay directly to the parameters, keeping the adaptive lr clean. Most GPT-family models use AdamW.

Honest Limitations

Convergence to sharp minima. Adam's adaptive per-parameter lr allows it to fit the training set very efficiently — sometimes too efficiently. For tasks where generalization matters more than train loss (most real tasks), SGD + momentum with a tuned learning rate schedule often produces better final test performance. This has been documented on CIFAR-10/100 and ImageNet since 2017.

ε matters in practice. The default ε=1e-8 can cause numerical issues when v̂ is very small (parameters with near-zero squared gradients). Setting ε=1e-4 is recommended for models with many small gradients (e.g., transformers with gradient clipping). The wrong ε causes the optimizer to take enormous steps on dormant parameters.

Adam is not theoretically proven to converge in general. The original Adam paper proved convergence for convex objectives. For non-convex neural networks, there are known counterexamples where Adam diverges (AMSGrad was proposed to fix this by using the maximum of past v̂ values). In practice, Adam converges reliably, but the theoretical guarantees are weaker than SGD's.

Test Your Understanding

At t=1 with g=0.5, bias-corrected m̂₁=0.5 and v̂₁=0.25. The step is η × m̂₁/√v̂₁ = 0.001 × 0.5/0.5 = 0.001. Now suppose the gradient is 10× larger: g=5.0. Compute m̂₁ and v̂₁ with this new gradient. What is the step size? Does Adam step 10× larger for a 10× larger gradient? Why or why not?
The bias correction divisor at t=1 for β₁=0.9 is (1−0.9) = 0.1, giving m̂ = m/0.1 = 10m. For β₂=0.999, the divisor is (1−0.999) = 0.001, giving v̂ = v/0.001 = 1000v. This means at t=1, v is amplified 1000× more than m. Why is this important for the first step's stability?
Adam's step equation is θ ← θ − η·m̂/√v̂. Suppose you set β₁=0 (no momentum) and β₂=0 (no second moment history). Show what m̂ and v̂ reduce to at each step t. What optimizer does Adam reduce to in this case?
The "generalization gap" between Adam and SGD+momentum on ImageNet suggests Adam converges to sharper minima. Assuming this is caused by Adam's per-parameter lr (not its momentum), propose an experiment that would verify this: what specific change to Adam's hyperparameters would make it more likely to find wider minima?
AdamW decouples weight decay from the gradient update: θ ← θ − η·(m̂/√v̂) − η·λ·θ. Standard Adam applies L2 regularization by adding λ·θ to the gradient g before computing moments. Explain why these two approaches are different. For a parameter with a large v̂ (large accumulated squared gradient), how does the weight decay contribution relative to the gradient-driven update differ between Adam-L2 and AdamW?

Adam Optimizer

What Adam Combines

The 5 Update Equations

Why Bias Correction?

Numerical Walkthrough (3 Steps for w₁)

Trace Table

Why Adam is the Default

When Adam Might Not Be Best

Hyperparameter Guide

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Adam Optimizer

What Adam Combines

The 5 Update Equations

Why Bias Correction?

Numerical Walkthrough (3 Steps for w₁)

Trace Table

Why Adam is the Default

When Adam Might Not Be Best

Hyperparameter Guide

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment