~/blog

Mini-Batch SGD

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Batch GD: one update per epoch, exact gradient, slow. SGD: n updates per epoch, noisy gradient, fast but volatile. Mini-batch SGD: k updates per epoch, approximate gradient, GPU-efficient, practical default. Almost every production model is trained with mini-batch SGD or a variant built on top of it.

The batch size k is the one hyperparameter this post is entirely about.

Anchor: 10 samples of salary data.

python
X = np.random.randn(10)           # normalized experience
y = 2*X + noise                   # salary
batch_size = 2                    # 10/2 = 5 updates per epoch

The Idea

Mini-batch SGD: split the training data into batches of size k. For each batch, compute the average gradient over those k samples, update the weights, move to the next batch.

  • k = 1 → SGD (single sample)
  • k = n → Batch GD (all samples)
  • k ∈ [32, 256] → Mini-batch SGD (practical default)

One epoch = n/k weight updates. With 10 samples and k=2: 5 updates per epoch.

The gradient for a mini-batch is the average over the k samples in that batch:

∂J_batch/∂w = (1/k) Σᵦ (ŷᵢ − yᵢ) · xᵢ

This is more accurate than the single-sample SGD gradient but still faster than waiting for all n samples.


First Two Mini-Batch Updates

With w=0, b=0, lr=0.1, batch size=2:

Batch 1 (samples 0 and 1):

Let x₀=−1.2, y₀=−2.4, x₁=0.5, y₁=1.0 (approximate from random seed):

ŷ₀ = 0×(−1.2) + 0 = 0; ŷ₁ = 0×0.5 + 0 = 0

Errors: (0−(−2.4)) = 2.4; (0−1.0) = −1.0

∂J/∂w = (1/2)[2.4×(−1.2) + (−1.0)×0.5] = (1/2)[−2.88 − 0.5] = (1/2)(−3.38) = −1.690 ∂J/∂b = (1/2)[2.4 + (−1.0)] = (1/2)(1.4) = 0.700

w = 0 − 0.1 × (−1.690) = 0.169, b = 0 − 0.1 × 0.700 = −0.070

Batch 2 (samples 2 and 3):

With the updated w=0.169, b=−0.070, predict for the next 2 samples and repeat.

Each batch update uses the weights from the previous batch — weights update 5 times within the epoch, not once.


Batch Size Tradeoffs

Batch sizeGradient qualityUpdates/epochMemoryGeneralization
1 (SGD)Very noisynMinimalOften better (noisy)
32–256Good estimaten/kModerateStandard
n (Batch GD)Exact1HighOften worse (sharp minima)

The tradeoff is not just speed: the gradient quality affects what minimum the optimizer finds. More updates with noisier gradients (small k) tends to find wider, flatter minima that generalize better.


Sharp Minima vs Wide Minima

Smaller batches introduce noise into the gradient estimate. That noise prevents the optimizer from settling into sharp, narrow minima on the loss surface.

Sharp Minimum vs Wide Minimum — Generalization Gap Sharp Minimum Large-batch SGD J_train low tiny perturbation → loss spikes! Wide Minimum Small-batch SGD J_train slightly higher perturbation → loss barely changes ✓ Wide minimum → robust to weight perturbations → generalizes better to test data.

Keskar et al. (2017) showed that large-batch training consistently converges to sharp minimizers that generalize worse. Small-batch training converges to wide minimizers. The noise from small batches is the mechanism — the optimizer can't settle into sharp valleys because each noisy step pushes it out.


GPU Efficiency Sweet Spot

GPUs execute operations on matrices. Processing a batch of 32 samples is nearly as fast as processing 1 sample because the GPU executes the same matrix multiply in parallel. The GPU's effective utilization (FLOPS) increases with batch size — up to the point where the batch no longer fits in GPU memory.

The time per epoch for mini-batch SGD follows a U-shape:

  • Very small batch (k=1, 2): many slow sequential updates, GPU underutilized
  • Optimal range (k=32–256): GPU fully utilized, fast updates
  • Very large batch (k=1024+): memory pressure, cache misses, slower per-step

The ideal batch size is the largest power of 2 that fits in GPU VRAM — typically 32, 64, 128, or 256.

Mini-Batch Gradient Flow (k=2) L⁽¹⁾ — Sample 1 ∂L⁽¹⁾/∂w = −2.88 L⁽²⁾ — Sample 2 ∂L⁽²⁾/∂w = −0.50 Average gradient (−2.88−0.50)/2 = −1.69 w ← 0 − 0.1×(−1.69) = 0.169 Average is less noisy than single sample but faster than waiting for all n samples.

Shuffling

Shuffle training data before each epoch. Without shuffling, batch 1 always gets the same k samples in the same order. If the data is sorted by label (all low-salary first, then high-salary), the first batches have systematically negative gradients and the last batches have positive gradients — the model oscillates early and stabilizes late. Shuffling ensures each batch is a random draw from the full dataset.


Code

python
import numpy as np

np.random.seed(42)
X = np.random.randn(10)
y = 2 * X + np.random.randn(10) * 0.1

w, b, lr, batch_size = 0.0, 0.0, 0.1, 2

for epoch in range(3):
    idx   = np.random.permutation(len(X))
    X_sh  = X[idx]
    y_sh  = y[idx]
    for start in range(0, len(X), batch_size):
        Xb = X_sh[start : start + batch_size]
        yb = y_sh[start : start + batch_size]
        y_hat = w * Xb + b
        dw    = 2 * np.mean((y_hat - yb) * Xb)
        db    = 2 * np.mean(y_hat - yb)
        w    -= lr * dw
        b    -= lr * db
    loss = np.mean((w * X + b - y) ** 2)
    print(f"Epoch {epoch+1}: w={w:.4f}, b={b:.4f}, loss={loss:.4f}")
text
Epoch 1: w=1.7284, b=0.0213, loss=0.1083
Epoch 2: w=1.8812, b=0.0226, loss=0.0343
Epoch 3: w=1.9450, b=0.0196, loss=0.0130

w converges toward the true value of 2.0. With batch size 2 and 10 samples, each epoch performs 5 updates (vs 1 for batch GD) — the model converges much faster per epoch.


Where this builds from: SGD (02) is the k=1 special case of mini-batch SGD. Batch GD (01) is the k=n special case. Mini-batch SGD is the practical compromise between them, and every production training run uses it.

Where this leads: Momentum (04) addresses the remaining problem: mini-batch gradients are still noisy and can oscillate in narrow loss valleys. Momentum accumulates a velocity over past gradients to smooth the trajectory. Adagrad, RMSProp, and Adam adapt the learning rate per parameter — they are all applied on top of the mini-batch gradient, not in place of it.


Honest Limitations

Batch size is not infinitely tunable. GPU memory limits the maximum batch size. The minimum batch size is 1 (SGD). But the relationship between batch size and generalization is not monotone — very small batches with too-large learning rates also generalize poorly because the optimizer diverges. The Goldilocks zone is task-specific.

The generalization benefit of small batches interacts with learning rate. The paper showing sharp vs wide minima (Keskar et al.) assumed fixed learning rate. If you scale the learning rate linearly with batch size (as Facebook's large-batch training paper suggests), large-batch training can match small-batch generalization. The interaction between batch size, learning rate, and generalization is still an active research area.

Mini-batch gradients are unbiased but have batch-to-batch variance. The variance decreases as k increases (standard error = σ/√k). At k=1, variance is σ². At k=32, it is σ²/32. This variance reduction explains the smoother loss curves for larger batches.


Test Your Understanding

  1. With n=1000 training samples and batch size k=32, one epoch requires 31.25 batches (assuming we drop the incomplete last batch). With k=64, it requires 15.6 batches. If each batch takes 0.01 seconds to process, how long does one epoch take for k=32 vs k=64? What is the GPU efficiency trade-off between these two choices?

  2. The claim "small batches find wider minima" is empirically supported by Keskar et al. Propose a thought experiment: if you train two models with batch size 1 (SGD) and batch size n (Batch GD) on an overparameterized network (more parameters than training samples), which would you expect to generalize better and why?

  3. You train a model with batch size 256 and achieve validation accuracy 92%. A colleague suggests switching to batch size 32. List three effects this change would have: on per-epoch training time, on loss curve smoothness, and on likely final test accuracy. Justify each.

  4. Shuffling data before each epoch prevents the model from seeing correlated batches. What would happen if the data were shuffled once (before training starts) but not between epochs? Would this be better than no shuffling? Worse than per-epoch shuffling? Explain.

  5. Mini-batch gradient is ∂J_batch/∂w = (1/k)Σ(ŷᵢ−yᵢ)·xᵢ over the k samples in the batch. If all k samples in a batch happen to have the same label (all class 1 in a binary problem), in what direction will the gradient point? What happens to the bias update? And why is per-epoch shuffling essential to prevent this from happening systematically?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment