~/blog

Stochastic Gradient Descent (SGD)

Jul 1, 20269 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Batch gradient descent computes the exact gradient — but it requires processing all n training samples before taking a single weight update. With 1 million samples, that means 1 million forward and backward passes to move the weights once. SGD makes a completely different trade: use one sample, compute an approximation of the gradient, update immediately. Then move to the next sample.

The gradient estimate from one sample is noisy. That noise turns out to be useful, not just a problem to tolerate.

Same anchor: salary from experience, normalized.

python
X_n = [-1.414, -0.707, 0.000, 0.707, 1.414]  # normalized years experience
y_n = [-1.414, -0.707, 0.000, 0.707, 1.414]  # normalized salary

Start: w=0, b=0, η=0.1.


What Changes from Batch GD

Batch GDSGD
Samples per updateAll n1
Updates per epoch1n
GradientExact: ∂J/∂wApproximate: ∂L⁽ⁱ⁾/∂w
Computation per stepExpensiveCheap

For 5 samples, batch GD takes 1 update per epoch; SGD takes 5. For 1 million samples, the ratio is 1:1,000,000 per epoch.

The SGD update uses the single-sample loss gradient, not the full cost gradient:

∂L⁽ⁱ⁾/∂w = 2(ŷᵢ − yᵢ) · xᵢ (no 1/n averaging — just one sample)

w ← w − η · ∂L⁽ⁱ⁾/∂w


Epoch 1 — All 5 Update Steps

Starting with w=0, b=0. Each sample updates the weights before the next sample is processed.

Step 1 — Sample 1 (x₁=−1.414, y₁=−1.414): ŷ₁ = 0×(−1.414) + 0 = 0.000 L₁ = (0 − (−1.414))² = 2.000 ∂L/∂w = 2(0 − (−1.414))(−1.414) = 2 × 1.414 × (−1.414) = −4.000 ∂L/∂b = 2(0 − (−1.414)) = 2.828 w = 0 − 0.1 × (−4.000) = 0.400, b = 0 − 0.1 × 2.828 = −0.283

Step 2 — Sample 2 (x₂=−0.707, y₂=−0.707): ŷ₂ = 0.400×(−0.707) + (−0.283) = −0.283 − 0.283 = −0.566 L₂ = (−0.566 − (−0.707))² = (0.141)² = 0.020 ∂L/∂w = 2 × (−0.566 − (−0.707)) × (−0.707) = 2 × 0.141 × (−0.707) = −0.199 ∂L/∂b = 2 × 0.141 = 0.282 w = 0.400 − 0.1 × (−0.199) = 0.420, b = −0.283 − 0.1 × 0.282 = −0.311

Step 3 — Sample 3 (x₃=0.000, y₃=0.000): ŷ₃ = 0.420×0 + (−0.311) = −0.311 L₃ = (−0.311 − 0)² = 0.097 ∂L/∂w = 2 × (−0.311) × 0 = 0.000 ∂L/∂b = 2 × (−0.311) = −0.622 w = 0.420 (unchanged), b = −0.311 − 0.1 × (−0.622) = −0.249

Step 4 — Sample 4 (x₄=0.707, y₄=0.707): ŷ₄ = 0.420×0.707 + (−0.249) = 0.297 − 0.249 = 0.048 L₄ = (0.048 − 0.707)² = 0.434 ∂L/∂w = 2 × (0.048 − 0.707) × 0.707 = 2 × (−0.659) × 0.707 = −0.932 ∂L/∂b = 2 × (−0.659) = −1.318 w = 0.420 − 0.1 × (−0.932) = 0.513, b = −0.249 − 0.1 × (−1.318) = −0.117

Step 5 — Sample 5 (x₅=1.414, y₅=1.414): ŷ₅ = 0.513×1.414 + (−0.117) = 0.725 − 0.117 = 0.608 L₅ = (0.608 − 1.414)² = 0.650 ∂L/∂w = 2 × (0.608 − 1.414) × 1.414 = 2 × (−0.806) × 1.414 = −2.281 ∂L/∂b = 2 × (−0.806) = −1.612 w = 0.513 − 0.1 × (−2.281) = 0.741, b = −0.117 − 0.1 × (−1.612) = 0.044

After one epoch (5 updates): w = 0.741, b ≈ 0. Compare to batch GD after one epoch: w = 0.200. SGD is much further along the loss surface after the same number of samples because it takes 5 update steps instead of 1.

SGD Gradient Flow — Step 1 (Sample 1) L⁽¹⁾ = 2.000 (single sample) ∂L/∂w = 2(ŷ−y)·x = 2×(1.414)×(−1.414) = −4.0 w ← 0 − 0.1×(−4) = 0.4 b ← 0 − 0.1×2.828 = −0.283 No 1/n averaging — single-sample gradient is noisier but computed instantly. Next sample immediately uses the updated w=0.4, b=−0.283.

Trace Table

StepSamplex_normy_normŷLw_newb_new
11−1.414−1.4140.0002.0000.400−0.283
22−0.707−0.707−0.5660.0200.420−0.311
330.0000.000−0.3110.0970.420−0.249
440.7070.7070.0480.4340.513−0.117
551.4141.4140.6080.6500.7410.044

Why SGD is Noisy (and Why That's Useful)

The gradient from one sample is an estimator of the true gradient ∂J/∂w. It is unbiased (correct in expectation) but has high variance. This means the loss curve bounces rather than descending smoothly.

Batch GD (smooth) vs SGD (noisy) — Same Training Problem Epoch (or update step) Cost J Batch GD SGD (noisy but faster) SGD oscillates — each update uses an estimate, not the exact gradient. But it converges faster in wall-clock time.

The noise is not purely bad. It helps for two reasons:

1. Escapes saddle points. In deep neural networks, the main obstacle is not local minima but saddle points — points where some directions are uphill and others are downhill. The exact gradient at a saddle point is zero, so batch GD stops. A noisy gradient from one sample has a random component that kicks the optimizer off the saddle.

2. Finds wider minima. Wide minima (shallow, broad valleys) generalize better than sharp minima (narrow, deep peaks). Noisy gradient updates tend to overshoot sharp minima and settle into broader ones. Batch GD finds the nearest minimum — which may be sharp.


Online Learning Property

SGD can process samples as they arrive — no need to store or see all data before updating. This enables:

  • Fraud detection: each new transaction triggers a gradient update, keeping the model current
  • Recommendation systems: user interactions immediately adjust the embedding weights
  • Very large datasets: the model trains on a stream, not a stored collection

Learning Rate Sensitivity

SGD is more sensitive to η than batch GD because each gradient estimate has higher variance. A large η combined with a noisy gradient is more likely to overshoot.

  • η = 0.01: stable but slow — the variance of single-sample gradients is under control
  • η = 0.1: good convergence for this dataset
  • η = 0.5: often diverges within 2–3 epochs, especially when a single sample has an extreme gradient

Code

python
import numpy as np

X_n = np.array([-1.414, -0.707, 0.000, 0.707, 1.414])
y_n = np.array([-1.414, -0.707, 0.000, 0.707, 1.414])

w, b, lr = 0.0, 0.0, 0.1

print(f"{'Step':>4} | {'x':>6} | {'y':>6} | {'ŷ':>6} | {'L':>6} | {'w':>6} | {'b':>7}")
print("-" * 60)

for i, (xi, yi) in enumerate(zip(X_n, y_n)):
    yhat = w * xi + b
    loss = (yhat - yi) ** 2
    dw   = 2 * (yhat - yi) * xi
    db   = 2 * (yhat - yi)
    w -= lr * dw
    b -= lr * db
    print(f"{i+1:>4} | {xi:>6.3f} | {yi:>6.3f} | {yhat:>6.3f} | {loss:>6.4f} | {w:>6.4f} | {b:>7.4f}")
text
Step |      x |      y |      ŷ |      L |      w |       b
------------------------------------------------------------
   1 | -1.414 | -1.414 |  0.000 | 2.0000 | 0.4000 | -0.2828
   2 | -0.707 | -0.707 | -0.566 | 0.0200 | 0.4200 | -0.3110
   3 |  0.000 |  0.000 | -0.311 | 0.0967 | 0.4200 | -0.2486
   4 |  0.707 |  0.707 |  0.048 | 0.4330 | 0.5132 | -0.1168
   5 |  1.414 |  1.414 |  0.608 | 0.6497 | 0.7414 |  0.0444

Where this builds from: Batch gradient descent (01) defined the update rule using the full cost gradient ∂J/∂w. SGD replaces that with the single-sample gradient ∂L⁽ⁱ⁾/∂w — same formula, different denominator.

Where this leads: Mini-batch SGD (next post) is the practical middle ground between batch GD and SGD. Momentum (04) smooths SGD's noisy gradient by accumulating a velocity over past gradients — it directly addresses the oscillation visible in the loss curve above.


Honest Limitations

SGD requires a decreasing learning rate schedule to converge exactly. The theoretical guarantee for SGD to converge to a stationary point requires η to decrease over time (Σηₜ = ∞ and Σηₜ² < ∞). In practice, constant η is used with early stopping — the model is stopped before it can diverge — but this means SGD doesn't strictly converge to the minimum.

The order of samples matters. If samples are sorted by label value (all class 0 first, then all class 1), the gradients in early steps are biased toward class 0 and the model will oscillate. Shuffling before each epoch removes this bias. This is why the PyTorch DataLoader shuffles by default.

Online learning prevents batched GPU computation. Processing one sample at a time does not utilize the parallelism of modern GPUs, which are optimized for matrix operations over batches. Pure SGD is slower than mini-batch SGD on GPU hardware even if it takes fewer epochs.


Test Your Understanding

  1. In Step 1, the gradient ∂L/∂w = −4.000 for a single sample is twice as large as the full-batch gradient ∂J/∂w = −2.000. Explain why the single-sample gradient is larger in this case. Would this always be true? Give a counterexample where a single-sample gradient is smaller than the batch gradient.

  2. After Step 3, w=0.420 is unchanged from Step 2. This is because sample 3 has x₃=0. Mathematically, show why a sample at x=0 can never update w (regardless of ŷ and y values). What does this imply about the informativeness of zero-feature samples for weight learning?

  3. SGD processes 5 samples per epoch, taking 5 weight updates. Batch GD processes 5 samples per epoch, taking 1 weight update. After 10 epochs, SGD has taken 50 weight updates; batch GD has taken 10. If each update costs ε time regardless of batch size, which algorithm reaches w=0.9 first, and why?

  4. The "escaping saddle points" argument for SGD noise assumes the noise is zero-mean. Verify this for our dataset: what is the mean of the 5 single-sample gradients (∂L⁽ⁱ⁾/∂w) computed at w=0, b=0? Is the mean equal to the batch gradient ∂J/∂w = −2.000?

  5. A fraud detection system uses SGD to update a model on each incoming transaction. The model starts with good weights from a batch training run. After 10,000 transactions (all legitimate), the model has never seen a fraud example. What happens to the fraud-detection weights, and why? How would you redesign the update rule to prevent this?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment