~/blog
Label Smoothing Cross-Entropy
Standard cross-entropy with one-hot labels has a gradient problem. To produce a hard prediction like [0.001, 0.001, 0.998, 0.001] from softmax, the pre-softmax logit for the true class must be much larger than all others. How much larger? To achieve 0.998 in a 4-class softmax, the true class logit must exceed the others by about log(0.998/0.001) ≈ 6.9 nats. There is no finite logit value that produces exactly 1.0 — so the model keeps growing its logits throughout training, chasing an asymptote. The gradient at the true class is p_k − 1, which never reaches zero as long as p_k < 1.0.
The consequence: models become overconfident. The logits for incorrect classes shrink to very negative values, giving the model no ability to say "these two classes look similar." Label smoothing replaces hard targets with soft ones — instead of pushing the model to output exactly 1.0 for the true class, it trains the model toward 0.925. This is achievable at finite logit values, and the gradient becomes zero once the model hits 0.925 — preventing unbounded logit growth.
Anchor: 4-class classification (weather). True class = "rainy" (index 2). ε = 0.1.
The Overconfidence Setup
To achieve p = [0.001, 0.001, 0.998, 0.001] with softmax, the logits must satisfy: e^{z₂} / (e^{z₀} + e^{z₁} + e^{z₂} + e^{z₃}) = 0.998
If z₀=z₁=z₃=0, then z₂ ≈ log(0.998 × 3 / 0.002) ≈ log(1497) ≈ 7.3
For the smoothed target [0.025, 0.025, 0.925, 0.025]: the model only needs e^{z₂}/(3+e^{z₂}) = 0.925 → e^{z₂} = 0.925×4/(1−0.925·3/4) ... more precisely z₂ ≈ log(0.925 × 3 / 0.075) ≈ log(37) ≈ 3.6
The logit only needs to reach 3.6 instead of 7.3. Gradients push logits toward the target and stop when the output matches — at 0.925, the logit is done growing.
Label Smoothing Formula
y_smooth_k = (1 − ε) · y_hard_k + ε / K
- For the true class (y_hard = 1): y_smooth = (1 − ε) + ε/K = (1 − 0.1) + 0.1/4 = 0.9 + 0.025 = 0.925
- For all other classes (y_hard = 0): y_smooth = 0 + ε/K = 0.1/4 = 0.025
Anchor smoothed labels:
| class | y_hard | y_smooth | arithmetic |
|---|---|---|---|
| sunny | 0.0 | 0.025 | 0 + 0.1/4 |
| cloudy | 0.0 | 0.025 | 0 + 0.1/4 |
| rainy | 1.0 | 0.925 | (1−0.1) + 0.1/4 |
| stormy | 0.0 | 0.025 | 0 + 0.1/4 |
Cross-Entropy with Smoothed Labels
Model softmax output: p = [0.1, 0.2, 0.6, 0.1]
L = −Σ_k y_smooth_k · log(p_k)
Trace Table:
| class | y_hard | y_smooth | log(p_k) | y_smooth·log(p_k) |
|---|---|---|---|---|
| sunny | 0.000 | 0.025 | log(0.1)=−2.3026 | 0.025×(−2.3026)=−0.0576 |
| cloudy | 0.000 | 0.025 | log(0.2)=−1.6094 | 0.025×(−1.6094)=−0.0402 |
| rainy | 1.000 | 0.925 | log(0.6)=−0.5108 | 0.925×(−0.5108)=−0.4725 |
| stormy | 0.000 | 0.025 | log(0.1)=−2.3026 | 0.025×(−2.3026)=−0.0576 |
| CE (smooth) | = 0.6279 |
Hard-label CE on same prediction: −(0×log0.1 + 0×log0.2 + 1×log0.6 + 0×log0.1) = −log(0.6) = 0.5108
Smoothed CE (0.6279) > hard CE (0.5108) on this prediction. The off-class terms now contribute 0.1279 to the loss, which is the penalty for not predicting exactly 0.025 on non-true classes. This forces the model to spread some probability mass across classes rather than collapsing everything onto the true class.
Effect on Gradient
For softmax cross-entropy, the gradient of the loss with respect to the logit at class k is:
∂L/∂z_k = p_k − y_smooth_k
For the true class (rainy, k=2) with p_k=0.6:
- Hard label: ∂L/∂z₂ = 0.6 − 1.0 = −0.40
- Smoothed label: ∂L/∂z₂ = 0.6 − 0.925 = −0.325
The smoothed gradient is 18.75% smaller. Training stops pushing z₂ upward the moment p₂ reaches 0.925 — not 1.0. At that point, the model has finite logits, and the probability mass on other classes is still 0.025 each.
For an incorrect class (sunny, k=0) with p_k=0.1:
- Hard label: ∂L/∂z₀ = 0.1 − 0.0 = +0.10 (push down)
- Smoothed label: ∂L/∂z₀ = 0.1 − 0.025 = +0.075 (push down, less aggressively)
ε Sensitivity
| ε | true class label (y_smooth) | others |
|---|---|---|
| 0.0 | 1.000 | 0.000 |
| 0.1 | 0.925 | 0.025 |
| 0.2 | 0.850 | 0.050 |
| 0.3 | 0.775 | 0.075 |
At ε=0: standard cross-entropy, hard targets. At ε=0.1: standard default (used in Vision Transformer, T5, Inception-v3). At ε=0.3: heavy smoothing — the true class label is 0.775. The model may underfit because correctly classified samples still receive significant gradient pushback.
Code
import numpy as np
def label_smooth(y_hard, epsilon, K):
return (1 - epsilon) * y_hard + epsilon / K
def cross_entropy(y, p, eps=1e-10):
return -np.sum(y * np.log(p + eps))
K = 4
true_class = 2
epsilon = 0.1
classes = ["sunny", "cloudy", "rainy", "stormy"]
y_hard = np.zeros(K); y_hard[true_class] = 1.0
y_smooth = label_smooth(y_hard, epsilon, K)
p = np.array([0.1, 0.2, 0.6, 0.1])
ce_hard = cross_entropy(y_hard, p)
ce_smooth = cross_entropy(y_smooth, p)
print(f"{'Class':>8} | {'y_hard':>7} | {'y_smooth':>9} | {'log(p)':>8} | {'y_s*log(p)':>11}")
for c, yh, ys, pk in zip(classes, y_hard, y_smooth, p):
print(f"{c:>8} | {yh:>7.3f} | {ys:>9.3f} | {np.log(pk):>8.4f} | {ys*np.log(pk):>11.4f}")
print(f"\nCE (hard labels): {ce_hard:.4f}")
print(f"CE (label smoothing): {ce_smooth:.4f}")
grad_hard = p[true_class] - 1
grad_smooth = p[true_class] - (1 - epsilon + epsilon / K)
print(f"\nGradient at true class — hard: {grad_hard:.4f}, smooth: {grad_smooth:.4f}")
print("\nε sensitivity (smoothed label for true class):")
for eps in [0.0, 0.1, 0.2, 0.3]:
ys = label_smooth(y_hard, eps, K)
print(f" ε={eps}: true class label = {ys[true_class]:.3f}, others = {ys[0]:.3f}")Class | y_hard | y_smooth | log(p) | y_s*log(p)
sunny | 0.000 | 0.025 | -2.3026 | -0.0576
cloudy | 0.000 | 0.025 | -1.6094 | -0.0402
rainy | 1.000 | 0.925 | -0.5108 | -0.4725
stormy | 0.000 | 0.025 | -2.3026 | -0.0576
CE (hard labels): 0.5108
CE (label smoothing): 0.6279
Gradient at true class — hard: -0.4000, smooth: -0.3250
ε sensitivity (smoothed label for true class):
ε=0.0: true class label = 1.000, others = 0.000
ε=0.1: true class label = 0.925, others = 0.025
ε=0.2: true class label = 0.850, others = 0.050
ε=0.3: true class label = 0.775, others = 0.075Related Concepts
Label smoothing modifies the target distribution in cross-entropy (03-classification-losses.md), which is applied after softmax (03-activations/07-softmax.md) converts logits to probabilities. Knowledge distillation uses soft targets from a teacher model — this is label smoothing taken to its logical conclusion, where the "smoothed" targets are actual teacher probabilities rather than uniform noise. Temperature scaling for calibration is the dual approach: instead of modifying targets, it scales logits at inference time to reduce overconfidence.
Honest Limitations
The right ε is task-dependent and must be tuned. With ε=0.2 on a task with genuine near-perfect training accuracy (e.g., fine-grained image recognition with 10,000 labeled examples per class), you prevent the model from ever being highly confident, which hurts both training speed and final accuracy. The default ε=0.1 may not transfer across domains.
Label smoothing is not useful for regression or when calibrated probabilities are the output. Smoothing is specifically about preventing overconfident categorical distributions. For regression tasks, confidence calibration is handled by uncertainty estimation methods (MC dropout, deep ensembles, conformal prediction), not label smoothing.
Post-hoc error analysis becomes harder. When the model makes errors, you want to understand which classes it confuses. With soft labels, the training signal says "be a little uncertain about everything" — so the model's confusion matrix is harder to read. You can't tell whether the model learned that rainy and cloudy are visually similar (useful signal) or whether it's just uniformly uncertain about everything (label smoothing side effect).
Test Your Understanding
-
With ε=0.1 and K=10 classes, compute the smoothed label for the true class and for each other class. How does the smoothed label for the true class change relative to K=4?
-
A model outputs p = [0.02, 0.01, 0.95, 0.02] for the anchor. Compute the gradient at the true class (rainy, k=2) under hard CE and under label smoothing with ε=0.1. Which gradient is larger in magnitude?
-
Why does label smoothing have the same effect as adding a KL term between the model's output distribution and a uniform distribution to the cross-entropy loss? Write out the modified loss in terms of both formulations.
-
You train an image classifier with label smoothing ε=0.1. At test time, you check the model's top-1 softmax probability on correctly classified examples and find it averages 0.71. Is this expected? What maximum probability do you expect for this ε and K?
-
A model trained with hard labels gives accuracy 94% on validation. Retraining with ε=0.1 gives 95% but the maximum softmax output drops from 0.97 to 0.88. An engineer says "the model is less confident, so it's worse — we should use the original." What is the flaw in this reasoning?