~/blog

Label Smoothing Cross-Entropy

Jul 3, 2026•7 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

Standard cross-entropy with one-hot labels has a gradient problem. To produce a hard prediction like [0.001, 0.001, 0.998, 0.001] from softmax, the pre-softmax logit for the true class must be much larger than all others. How much larger? To achieve 0.998 in a 4-class softmax, the true class logit must exceed the others by about log(0.998/0.001) ≈ 6.9 nats. There is no finite logit value that produces exactly 1.0 — so the model keeps growing its logits throughout training, chasing an asymptote. The gradient at the true class is p_k − 1, which never reaches zero as long as p_k < 1.0.

The consequence: models become overconfident. The logits for incorrect classes shrink to very negative values, giving the model no ability to say "these two classes look similar." Label smoothing replaces hard targets with soft ones — instead of pushing the model to output exactly 1.0 for the true class, it trains the model toward 0.925. This is achievable at finite logit values, and the gradient becomes zero once the model hits 0.925 — preventing unbounded logit growth.

Anchor: 4-class classification (weather). True class = "rainy" (index 2). ε = 0.1.

The Overconfidence Setup

To achieve p = [0.001, 0.001, 0.998, 0.001] with softmax, the logits must satisfy: e^{z₂} / (e^{z₀} + e^{z₁} + e^{z₂} + e^{z₃}) = 0.998

If z₀=z₁=z₃=0, then z₂ ≈ log(0.998 × 3 / 0.002) ≈ log(1497) ≈ 7.3

For the smoothed target [0.025, 0.025, 0.925, 0.025]: the model only needs e^{z₂}/(3+e^{z₂}) = 0.925 → e^{z₂} = 0.925×4/(1−0.925·3/4) ... more precisely z₂ ≈ log(0.925 × 3 / 0.075) ≈ log(37) ≈ 3.6

The logit only needs to reach 3.6 instead of 7.3. Gradients push logits toward the target and stop when the output matches — at 0.925, the logit is done growing.

Label Smoothing Formula

y_smooth_k = (1 − ε) · y_hard_k + ε / K

For the true class (y_hard = 1): y_smooth = (1 − ε) + ε/K = (1 − 0.1) + 0.1/4 = 0.9 + 0.025 = 0.925
For all other classes (y_hard = 0): y_smooth = 0 + ε/K = 0.1/4 = 0.025

Anchor smoothed labels:

class	y_hard	y_smooth	arithmetic
sunny	0.0	0.025	0 + 0.1/4
cloudy	0.0	0.025	0 + 0.1/4
rainy	1.0	0.925	(1−0.1) + 0.1/4
stormy	0.0	0.025	0 + 0.1/4

Cross-Entropy with Smoothed Labels

Model softmax output: p = [0.1, 0.2, 0.6, 0.1]

L = −Σ_k y_smooth_k · log(p_k)

Trace Table:

class	y_hard	y_smooth	log(p_k)	y_smooth·log(p_k)
sunny	0.000	0.025	log(0.1)=−2.3026	0.025×(−2.3026)=−0.0576
cloudy	0.000	0.025	log(0.2)=−1.6094	0.025×(−1.6094)=−0.0402
rainy	1.000	0.925	log(0.6)=−0.5108	0.925×(−0.5108)=−0.4725
stormy	0.000	0.025	log(0.1)=−2.3026	0.025×(−2.3026)=−0.0576
CE (smooth)				= 0.6279

Hard-label CE on same prediction: −(0×log0.1 + 0×log0.2 + 1×log0.6 + 0×log0.1) = −log(0.6) = 0.5108

Smoothed CE (0.6279) > hard CE (0.5108) on this prediction. The off-class terms now contribute 0.1279 to the loss, which is the penalty for not predicting exactly 0.025 on non-true classes. This forces the model to spread some probability mass across classes rather than collapsing everything onto the true class.

Effect on Gradient

For softmax cross-entropy, the gradient of the loss with respect to the logit at class k is:

∂L/∂z_k = p_k − y_smooth_k

For the true class (rainy, k=2) with p_k=0.6:

Hard label: ∂L/∂z₂ = 0.6 − 1.0 = −0.40
Smoothed label: ∂L/∂z₂ = 0.6 − 0.925 = −0.325

The smoothed gradient is 18.75% smaller. Training stops pushing z₂ upward the moment p₂ reaches 0.925 — not 1.0. At that point, the model has finite logits, and the probability mass on other classes is still 0.025 each.

For an incorrect class (sunny, k=0) with p_k=0.1:

Hard label: ∂L/∂z₀ = 0.1 − 0.0 = +0.10 (push down)
Smoothed label: ∂L/∂z₀ = 0.1 − 0.025 = +0.075 (push down, less aggressively)

ε Sensitivity

ε	true class label (y_smooth)	others
0.0	1.000	0.000
0.1	0.925	0.025
0.2	0.850	0.050
0.3	0.775	0.075

At ε=0: standard cross-entropy, hard targets. At ε=0.1: standard default (used in Vision Transformer, T5, Inception-v3). At ε=0.3: heavy smoothing — the true class label is 0.775. The model may underfit because correctly classified samples still receive significant gradient pushback.

Code

python

import numpy as np

def label_smooth(y_hard, epsilon, K):
    return (1 - epsilon) * y_hard + epsilon / K

def cross_entropy(y, p, eps=1e-10):
    return -np.sum(y * np.log(p + eps))

K = 4
true_class = 2
epsilon = 0.1
classes = ["sunny", "cloudy", "rainy", "stormy"]

y_hard = np.zeros(K); y_hard[true_class] = 1.0
y_smooth = label_smooth(y_hard, epsilon, K)
p = np.array([0.1, 0.2, 0.6, 0.1])

ce_hard = cross_entropy(y_hard, p)
ce_smooth = cross_entropy(y_smooth, p)

print(f"{'Class':>8} | {'y_hard':>7} | {'y_smooth':>9} | {'log(p)':>8} | {'y_s*log(p)':>11}")
for c, yh, ys, pk in zip(classes, y_hard, y_smooth, p):
    print(f"{c:>8} | {yh:>7.3f} | {ys:>9.3f} | {np.log(pk):>8.4f} | {ys*np.log(pk):>11.4f}")
print(f"\nCE (hard labels):     {ce_hard:.4f}")
print(f"CE (label smoothing): {ce_smooth:.4f}")

grad_hard = p[true_class] - 1
grad_smooth = p[true_class] - (1 - epsilon + epsilon / K)
print(f"\nGradient at true class — hard: {grad_hard:.4f}, smooth: {grad_smooth:.4f}")

print("\nε sensitivity (smoothed label for true class):")
for eps in [0.0, 0.1, 0.2, 0.3]:
    ys = label_smooth(y_hard, eps, K)
    print(f"  ε={eps}: true class label = {ys[true_class]:.3f}, others = {ys[0]:.3f}")

text

Class | y_hard |  y_smooth |   log(p) |  y_s*log(p)
   sunny |  0.000 |     0.025 |  -2.3026 |     -0.0576
  cloudy |  0.000 |     0.025 |  -1.6094 |     -0.0402
   rainy |  1.000 |     0.925 |  -0.5108 |     -0.4725
  stormy |  0.000 |     0.025 |  -2.3026 |     -0.0576

CE (hard labels):     0.5108
CE (label smoothing): 0.6279

Gradient at true class — hard: -0.4000, smooth: -0.3250

ε sensitivity (smoothed label for true class):
  ε=0.0: true class label = 1.000, others = 0.000
  ε=0.1: true class label = 0.925, others = 0.025
  ε=0.2: true class label = 0.850, others = 0.050
  ε=0.3: true class label = 0.775, others = 0.075

Label smoothing modifies the target distribution in cross-entropy (03-classification-losses.md), which is applied after softmax (03-activations/07-softmax.md) converts logits to probabilities. Knowledge distillation uses soft targets from a teacher model — this is label smoothing taken to its logical conclusion, where the "smoothed" targets are actual teacher probabilities rather than uniform noise. Temperature scaling for calibration is the dual approach: instead of modifying targets, it scales logits at inference time to reduce overconfidence.

Honest Limitations

The right ε is task-dependent and must be tuned. With ε=0.2 on a task with genuine near-perfect training accuracy (e.g., fine-grained image recognition with 10,000 labeled examples per class), you prevent the model from ever being highly confident, which hurts both training speed and final accuracy. The default ε=0.1 may not transfer across domains.

Label smoothing is not useful for regression or when calibrated probabilities are the output. Smoothing is specifically about preventing overconfident categorical distributions. For regression tasks, confidence calibration is handled by uncertainty estimation methods (MC dropout, deep ensembles, conformal prediction), not label smoothing.

Post-hoc error analysis becomes harder. When the model makes errors, you want to understand which classes it confuses. With soft labels, the training signal says "be a little uncertain about everything" — so the model's confusion matrix is harder to read. You can't tell whether the model learned that rainy and cloudy are visually similar (useful signal) or whether it's just uniformly uncertain about everything (label smoothing side effect).

Test Your Understanding

With ε=0.1 and K=10 classes, compute the smoothed label for the true class and for each other class. How does the smoothed label for the true class change relative to K=4?
A model outputs p = [0.02, 0.01, 0.95, 0.02] for the anchor. Compute the gradient at the true class (rainy, k=2) under hard CE and under label smoothing with ε=0.1. Which gradient is larger in magnitude?
Why does label smoothing have the same effect as adding a KL term between the model's output distribution and a uniform distribution to the cross-entropy loss? Write out the modified loss in terms of both formulations.
You train an image classifier with label smoothing ε=0.1. At test time, you check the model's top-1 softmax probability on correctly classified examples and find it averages 0.71. Is this expected? What maximum probability do you expect for this ε and K?
A model trained with hard labels gives accuracy 94% on validation. Retraining with ε=0.1 gives 95% but the maximum softmax output drops from 0.97 to 0.88. An engineer says "the model is less confident, so it's worse — we should use the original." What is the flaw in this reasoning?

Label Smoothing Cross-Entropy

The Overconfidence Setup

Label Smoothing Formula

Cross-Entropy with Smoothed Labels

Effect on Gradient

ε Sensitivity

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Label Smoothing Cross-Entropy

The Overconfidence Setup

Label Smoothing Formula

Cross-Entropy with Smoothed Labels

Effect on Gradient

ε Sensitivity

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment