~/blog

Early Stopping

Jul 3, 2026•7 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

Training longer is not always better. A model's validation loss typically decreases for some number of epochs, reaches a minimum, then climbs as the model starts memorizing training-specific patterns. L1 and L2 regularization fight overfitting by penalizing weight magnitudes. Early stopping fights it by simply stopping training when validation performance stops improving. No change to the loss function, no change to the weights — just stop before the damage compounds.

The technique requires a held-out validation set (not the test set), a patience hyperparameter that controls how long to wait for improvement, and a checkpoint system that saves the best model so you can restore it at the end.

Anchor: 20-epoch training run with explicit per-epoch loss values. Patience = 3.

The Training Trajectory

epoch	train_loss	val_loss
1	1.20	1.18
2	1.00	0.98
3	0.80	0.79
4	0.68	0.70
5	0.55	0.60
6	0.47	0.56
7	0.38	0.52
8	0.30	0.51
9	0.25	0.50
10	0.22	0.49
11	0.18	0.48
12	0.14	0.47	← best val
13	0.11	0.48	patience 1/3
14	0.09	0.50	patience 2/3
15	0.08	0.51	patience 3/3 → stop
16	0.06	0.53	(would have continued)
17	0.05	0.55
18	0.04	0.58
19	0.03	0.61
20	0.02	0.63

The Algorithm

text

best_val = ∞
patience_counter = 0
best_epoch = 0

for each epoch:
    train one epoch
    compute val_loss
    
    if val_loss < best_val - min_delta:
        best_val = val_loss
        best_epoch = epoch
        save checkpoint
        patience_counter = 0
    else:
        patience_counter += 1
    
    if patience_counter >= patience:
        stop training
        restore checkpoint from best_epoch

Tracing through anchor at patience=3:

epoch	val_loss	improved?	patience_counter	action
12	0.47	Yes (0.47 < 0.48)	0	✓ save checkpoint
13	0.48	No	1	patience 1/3
14	0.50	No	2	patience 2/3
15	0.51	No	3	patience 3/3 → stop, restore epoch 12

Patience Parameter

patience	stops at epoch	risk
1	13	Stops too early — first non-improvement triggers stop, misses genuine plateau
3	15	Reasonable — waits long enough to distinguish noise from trend
10	20 (end of run)	Allows significant overfit before stopping

On anchor data: patience=1 stops at epoch 13 (val 0.48 after best 0.47). Patience=3 stops at epoch 15. Patience=10 would wait until epoch 22 (beyond the run), meaning it never stops.

min_delta

Without min_delta, val_loss 0.470 → 0.469 counts as improvement — a noise-level improvement resets the patience counter. This makes early stopping too sensitive to small fluctuations.

min_delta=0.01: improvement only counts if val_loss improves by at least 0.01.

With min_delta=0.01 on anchor: the move from 0.49 (epoch 10) to 0.48 (epoch 11) = 0.01 improvement — counts. From 0.48 to 0.47 (epoch 12) = 0.01 improvement — counts. From 0.47 to 0.48 (epoch 13) — no improvement. Behavior is the same here, but on noisier data with small oscillations, min_delta prevents false resets.

Checkpoint Timeline

Code

python

import numpy as np

train_losses = [1.20, 1.00, 0.80, 0.68, 0.55, 0.47, 0.38, 0.30, 0.25, 0.22,
                0.18, 0.14, 0.11, 0.09, 0.08, 0.06, 0.05, 0.04, 0.03, 0.02]
val_losses   = [1.18, 0.98, 0.79, 0.70, 0.60, 0.56, 0.52, 0.51, 0.50, 0.49,
                0.48, 0.47, 0.48, 0.50, 0.51, 0.53, 0.55, 0.58, 0.61, 0.63]

patience = 3
min_delta = 0.0
best_val = float('inf')
best_epoch = 0
patience_counter = 0

for epoch, (tl, vl) in enumerate(zip(train_losses, val_losses), 1):
    if vl < best_val - min_delta:
        best_val = vl
        best_epoch = epoch
        patience_counter = 0
        status = "✓ saved"
    else:
        patience_counter += 1
        status = f"patience {patience_counter}/{patience}"
    print(f"Epoch {epoch:2d} | train={tl:.2f} | val={vl:.2f} | {status}")
    if patience_counter >= patience:
        print(f"\nEarly stop at epoch {epoch}. Best epoch: {best_epoch}, val_loss: {best_val:.2f}")
        break

text

Epoch  1 | train=1.20 | val=1.18 | ✓ saved
Epoch  2 | train=1.00 | val=0.98 | ✓ saved
Epoch  3 | train=0.80 | val=0.79 | ✓ saved
Epoch  4 | train=0.68 | val=0.70 | ✓ saved
Epoch  5 | train=0.55 | val=0.60 | ✓ saved
Epoch  6 | train=0.47 | val=0.56 | ✓ saved
Epoch  7 | train=0.38 | val=0.52 | ✓ saved
Epoch  8 | train=0.30 | val=0.51 | ✓ saved
Epoch  9 | train=0.25 | val=0.50 | ✓ saved
Epoch 10 | train=0.22 | val=0.49 | ✓ saved
Epoch 11 | train=0.18 | val=0.48 | ✓ saved
Epoch 12 | train=0.14 | val=0.47 | ✓ saved
Epoch 13 | train=0.11 | val=0.48 | patience 1/3
Epoch 14 | train=0.09 | val=0.50 | patience 2/3
Epoch 15 | train=0.08 | val=0.51 | patience 3/3

Early stop at epoch 15. Best epoch: 12, val_loss: 0.47

Early stopping is a regularization technique alongside dropout (03-dropout.md) and L1/L2 regularization (08-l1-l2-regularization.md), but it operates on a fundamentally different mechanism: it controls training duration rather than penalizing the loss. In practice, early stopping is often combined with learning rate scheduling — when the validation loss stops improving, you first reduce the learning rate (ReduceLROnPlateau) and only stop if it still doesn't improve. This two-stage approach recovers more performance before giving up.

Honest Limitations

Early stopping requires a held-out validation set. For small datasets, setting aside 10–20% of data for validation reduces the training set non-trivially. If the model is data-limited, you may prefer k-fold cross-validation combined with a fixed training budget instead.

Patience is a hyperparameter without a principled default. Patience=5 is common in practice, but on noisy data or tasks with long learning plateaus (NLP fine-tuning, for example), patience=20 or higher is often needed to avoid stopping during what turns out to be a temporary plateau before further improvement.

Non-smooth validation loss curves — common when the validation set is small or the task has high variance (translation, generation) — make patience counters unreliable. A spike in val loss at one epoch doesn't mean overfitting has started. Smoothing val loss with an exponential moving average before comparing against best_val is a practical fix not often discussed in tutorials.

Test Your Understanding

In the anchor run, the validation loss at epoch 11 is 0.48 and at epoch 12 is 0.47. With patience=3 and min_delta=0.005, does epoch 12 count as an improvement? Show your reasoning.
You run early stopping with patience=1 on the anchor data. At which epoch does training stop, and what val_loss is restored? Is this a better or worse model than patience=3?
A model is trained on 1000 samples. You allocate 200 (20%) to validation for early stopping. Training stops at epoch 8. If you retrained on all 1000 samples for exactly 8 epochs (no early stopping), would you expect better or worse performance? Explain the trade-off.
Val loss on a noisy task follows: 0.50, 0.49, 0.52, 0.48, 0.53, 0.47, 0.51, 0.50... With patience=2, at which step does training stop? Is this the right decision? What would you change?
Early stopping is sometimes called "free regularization" because it requires no change to the model or loss. In what theoretical sense does stopping early act as regularization — what property of the solution space does it implicitly constrain?

Early Stopping

The Training Trajectory

The Algorithm

Patience Parameter

min_delta

Checkpoint Timeline

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Early Stopping

The Training Trajectory

The Algorithm

Patience Parameter

min_delta

Checkpoint Timeline

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment