~/blog

Adagrad

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

All previous optimizers — batch GD, SGD, mini-batch SGD, momentum — use the same learning rate η for every parameter. This is a bad assumption when different parameters appear at very different frequencies. The word embedding for "the" updates on every training sentence. The embedding for "photosynthesis" updates on every few thousand sentences. With a fixed η, "the" will overshoot (it sees enough gradients that a large step causes oscillation), and "photosynthesis" will underlearn (it sees too few gradients to make meaningful progress before convergence).

Adagrad fixes this by tracking how much each parameter has been updated, then scaling the learning rate inversely.

Anchor: two-parameter model — w₁ (years experience, activated every step) and w₂ (has_certification, activated 1 in 5 steps, mostly 0).


The Problem Adagrad Solves

SGD update: w ← w − η · g — same η for every parameter.

For w₂ (sparse, rarely sees a gradient):

  • When g₂=0: no update, no problem
  • When g₂≠0: it needs to learn fast from this rare opportunity — a small η means the weight barely moves

For w₁ (dense, large gradient every step):

  • Accumulated many large gradients → tends to overshoot → needs a smaller lr

Adagrad tracks G_t = accumulated sum of squared gradients per parameter and scales the lr accordingly.


Update Rule

G_t ← G_t + g_t² (accumulate squared gradients, per parameter, over all time steps)

θ ← θ − (η / √(G_t + ε)) · g_t

where ε = 1e-8 (avoid division by zero when G=0).

The effective learning rate for parameter θ at step t is:

effective_lr = η / √G_t

As G_t grows (the parameter has received many large gradients), effective_lr shrinks. Parameters with small accumulated gradients retain a larger effective lr.

Adagrad Gradient Flow (Step 2) g₁=0.4, g₂=0.3 (current gradients) G update G₁ += 0.4² → 0.41 G₂ += 0.3² → 0.09 (accumulated sum) Effective lr + update lr₁ = 0.1/√0.41 = 0.156 lr₂ = 0.1/√0.09 = 0.333 w₂ gets 2× larger lr Sparse w₂ (small G₂) → larger effective lr. Dense w₁ (large G₁) → smaller effective lr.

Numerical Walkthrough (3 Steps, 2 Parameters)

Initial: w₁=0, w₂=0, G₁=0, G₂=0, η=0.1.

Step 1: g₁=0.5, g₂=0.0 (w₂ inactive — no certification data in this sample)

G₁ = 0 + 0.5² = 0.250, G₂ = 0 + 0.0² = 0.0

lr₁ = 0.1 / √(0.250 + 1e-8) = 0.1 / 0.5 = 0.200

lr₂ = 0.1 / √(0.0 + 1e-8) = 0.1 / 0.0001 ≈ 3162 (conceptually huge — but g₂=0 so update is 0)

w₁ = 0 − 0.200 × 0.5 = −0.100, w₂ = 0 − 3162 × 0.0 = 0.000

Step 2: g₁=0.4, g₂=0.3 (w₂ now activated — sample has certification)

G₁ = 0.250 + 0.4² = 0.250 + 0.160 = 0.410

G₂ = 0.000 + 0.3² = 0.090

lr₁ = 0.1 / √0.410 = 0.1 / 0.640 = 0.156

lr₂ = 0.1 / √0.090 = 0.1 / 0.300 = 0.333

w₁ = −0.100 − 0.156 × 0.4 = −0.100 − 0.062 = −0.162

w₂ = 0.000 − 0.333 × 0.3 = −0.100

w₂ received a larger effective lr (0.333 vs 0.156) even though it just appeared for the first time. This is the adaptive lr in action: w₂ has seen fewer total gradients (G₂ is smaller) so it gets a larger step.

Step 3: g₁=0.3, g₂=0.0 (w₂ inactive again)

G₁ = 0.410 + 0.3² = 0.410 + 0.090 = 0.500

lr₁ = 0.1 / √0.500 = 0.1 / 0.707 = 0.141

Compare Step 1 lr₁=0.200 → Step 2 lr₁=0.156 → Step 3 lr₁=0.141. The effective lr for w₁ is monotonically shrinking as G₁ accumulates. This is the learning rate decay problem.


Trace Table

Stepg₁g₂G₁G₂eff_lr₁eff_lr₂w₁w₂
10.50.00.2500.0000.200~3162−0.1000.000
20.40.30.4100.0900.1560.333−0.162−0.100
30.30.00.5000.0900.1410.333−0.204−0.100

At Step 100 with constant gradient g₁=0.5 per step: G₁ = 100×0.25 = 25.0, lr₁ = 0.1/√25 = 0.02. At Step 1000: lr₁ = 0.1/√(1000×0.25) = 0.1/15.8 ≈ 0.006. Learning rate decays toward zero.


The Learning Rate Decay Problem

G accumulates forever. It only grows; it never shrinks. After enough training steps:

η / √G_t → 0 for all parameters

The model stops learning. This is fine for convex optimization (you want to slow down as you approach the minimum), but disastrous for non-convex neural networks where the loss surface changes character throughout training. A model trained for 100,000 steps with Adagrad will have a near-zero effective lr and will have essentially frozen its weights long before convergence.

This weakness leads directly to RMSProp (next post), which replaces the running sum G with an exponentially weighted moving average — old gradients fade out, keeping the effective lr stable.


When Adagrad Works Well

NLP with sparse word embeddings: The vocabulary has 50,000 words. Most words appear in a small fraction of training sentences. Their embeddings are updated rarely and need aggressive learning rates when they do appear. Adagrad gives exactly this — a large effective lr for infrequently updated parameters.

Convex optimization: Adagrad's learning rate decay is not a bug in convex settings — it is a feature. As you approach the convex minimum, the gradient decreases and the accumulated G causes the lr to shrink appropriately. The Adagrad paper (Duchi et al. 2011) proved optimal convergence rates for convex problems.

Classical recommendation systems: Collaborative filtering with categorical features (user IDs, item IDs) has the same sparsity structure as word embeddings — most user-item pairs are never seen in training.


Code

python
import numpy as np

gradients = [(0.5, 0.0), (0.4, 0.3), (0.3, 0.0), (0.5, 0.2), (0.4, 0.0)]
w   = np.array([0.0, 0.0])
G   = np.array([0.0, 0.0])
eta, eps = 0.1, 1e-8

print(f"{'Step':>4} | {'g1':>5} {'g2':>5} | {'G1':>7} {'G2':>7} | {'lr1':>6} {'lr2':>6} | {'w1':>7} {'w2':>7}")
print("-" * 72)

for i, (g1, g2) in enumerate(gradients):
    g       = np.array([g1, g2])
    G      += g ** 2
    eff_lr  = eta / (np.sqrt(G) + eps)
    w      -= eff_lr * g
    print(f"{i+1:>4} | {g1:>5.2f} {g2:>5.2f} | {G[0]:>7.4f} {G[1]:>7.4f} | {eff_lr[0]:>6.4f} {eff_lr[1]:>6.4f} | {w[0]:>7.4f} {w[1]:>7.4f}")
text
Step |    g1    g2 |      G1      G2 |    lr1    lr2 |      w1      w2
------------------------------------------------------------------------
   1 |  0.50  0.00 |  0.2500  0.0000 | 0.2000 3162.3 | -0.1000  0.0000
   2 |  0.40  0.30 |  0.4100  0.0900 | 0.1562 0.3333 | -0.1625 -0.1000
   3 |  0.30  0.00 |  0.5000  0.0900 | 0.1414 0.3333 | -0.2049 -0.1000
   4 |  0.50  0.20 |  0.7500  0.1300 | 0.1155 0.2774 | -0.2626 -0.1555
   5 |  0.40  0.00 |  0.9100  0.1300 | 0.1048 0.2774 | -0.3045 -0.1555

Step 1's lr₂ value of 3162 is clipped by the 1e-8 ε term — the effective lr is very large, but g₂=0 so no update occurs. Step 2's lr₂=0.333 is much more reasonable. Step 4: w₂ finally updates again when g₂=0.2 appears.


Where this builds from: Gradient descent (01) uses a fixed lr for all parameters. Adagrad was the first optimizer to adapt the lr per parameter based on gradient history. The squared gradient accumulation G is the key innovation.

Where this leads: RMSProp (06) fixes Adagrad's G-grows-forever problem by replacing the running sum with an exponentially weighted moving average. Adam (07) combines RMSProp's adaptive lr (second moment v_t) with momentum's gradient smoothing (first moment m_t).


Honest Limitations

lr decays to zero for all parameters given enough training steps. With more than ~100,000 gradient updates, Adagrad's effective lr becomes negligible for frequently updated parameters. Long training runs require learning rate warmup and cooldown schedules on top of Adagrad to prevent this.

Initial effective lr is unbounded for parameters with G=0. At Step 1, G₂=0 (before any gradient for w₂). The ε=1e-8 term gives lr₂ ≈ η/1e-4 = 1000×η — enormous. This is by design (the parameter needs to be able to make a large first update), but it requires that g₂ actually be small when this happens. If a sparse feature has a large first gradient, the initial update can overshoot wildly.

Requires η to be set. Even though Adagrad adapts the lr per parameter, the global η still matters. Too large a global η (e.g., η=1.0) can cause divergence on the first step for dense features, before G has accumulated enough to shrink the effective lr.


Test Your Understanding

  1. At Step 1, w₁ updates from 0 to −0.100 using g₁=0.5 and effective lr₁=0.200. With vanilla SGD (η=0.1), the update would be 0 − 0.1×0.5 = −0.050. Why does Adagrad give a 2× larger first step? What is the effective lr at Step 1 for any parameter whose G was previously 0?

  2. The G accumulation formula G_t ← G_t + g_t² shows that G only increases. Propose a modification to G that would allow it to decrease (giving the optimizer a way to "forget" old gradients). What property of gradients would this modification need to preserve to still guarantee that sparse parameters get higher lr?

  3. At Step 100, suppose G₁ = 25.0 (constant gradient of 0.5 every step). Compute the effective lr for w₁. At Step 1000 (G₁ = 250.0), compute it again. Between Step 100 and 1000, the model has done 900 additional steps but barely moved w₁. Is this convergence or failure? How would you distinguish the two from the training loss curve?

  4. Adagrad is recommended for NLP sparse embeddings. In a vocabulary of 10,000 words with embedding dimension 128, the embedding matrix has 1.28M parameters. Most rows (word embeddings) see gradients on <1% of training steps. Describe how Adagrad allocates its G storage and why this is memory-efficient compared to maintaining a full second-moment matrix (as required by L-BFGS).

  5. You are training a recommendation system and notice that after 50,000 steps, the loss has plateaued but you know the model could still improve. The optimizer is Adagrad with η=0.01. Without changing the loss function or model architecture, propose two interventions that would allow the model to continue learning.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment