~/blog

Softmax for Multiclass Classification

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Binary classification maps one neuron's output to a probability using sigmoid. Multiclass classification — three or more mutually exclusive categories — needs something different. You have one output neuron per class, and you need all their outputs to sum to 1 so the result is a valid probability distribution. Softmax does exactly that.

The anchor throughout: a 3-class iris classifier. The final hidden layer produces a logit vector z = [2.1, 0.8, −0.5], one value per species (setosa, versicolor, virginica).


The Formula

softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ

Each output is the exponentiated logit for that class divided by the sum of all exponentiated logits. This forces all outputs to be positive and to sum to 1.

Step-by-Step for z = [2.1, 0.8, −0.5]

Step 1 — Exponentiate each logit:

e²·¹ = 8.1662, e⁰·⁸ = 2.2255, e⁻⁰·⁵ = 0.6065

Step 2 — Sum:

8.1662 + 2.2255 + 0.6065 = 10.9982

Step 3 — Divide:

p₁ = 8.1662 / 10.9982 = 0.7425 (setosa)

p₂ = 2.2255 / 10.9982 = 0.2024 (versicolor)

p₃ = 0.6065 / 10.9982 = 0.0552 (virginica)

Verify: 0.7425 + 0.2024 + 0.0552 = 1.0001 ≈ 1.0 ✓

The network says: 74.25% setosa, 20.24% versicolor, 5.52% virginica. The highest logit (2.1) produces the highest probability (74.25%), but by a compressed ratio — softmax is more decisive than raw z, as we'll see below.

Softmax: Raw Logits → Probabilities Raw Logits z Setosa: z=2.1 Versicolor: z=0.8 Virginica: z=−0.5 eᶻ: [8.166, 2.226, 0.607] Sum = 10.999 ÷ each by sum → Probabilities p Setosa: 74.25% 20.2% 5.5% Sum = 1.0 ✓ All values in (0,1)

Why Exponentiation?

Three reasons exponentiation is the right operation here:

1. Guarantees positive outputs. Even if z₃ = −0.5, e⁻⁰·⁵ = 0.607 > 0. Any real-valued logit becomes positive after exponentiation. Division then places each value in (0, 1).

2. Amplifies differences. The logit gap between class 1 and class 2 is 2.1 − 0.8 = 1.3. The exponentiated ratio is e¹·³ = 3.67. Softmax is "more decisive" than raw z — it sharpens the contrast between high and low logits. A small logit advantage becomes a large probability advantage.

3. Preserves ordering. If z₁ > z₂, then e^z₁ > e^z₂, so p₁ > p₂. The argmax of z equals the argmax of softmax(z). The predicted class never changes — only the probability values change.


Temperature

Softmax with temperature τ: softmax(z/τ)

τBehaviorProbabilities for z=[2.1, 0.8, -0.5]
0.1Near one-hot — winner takes almost all[0.9999, 0.0001, 0.0000]
1.0Standard[0.7425, 0.2024, 0.0552]
5.0Near-uniform — high uncertainty[0.4026, 0.3196, 0.2777]

As τ → 0: the highest logit dominates completely — essentially argmax. As τ → ∞: all probabilities converge to 1/K (uniform distribution). Temperature is used in knowledge distillation, where τ > 1 "softens" the teacher model's predictions to provide richer supervision signals to the student.

Softmax Temperature τ: Sharp → Balanced τ = 0.1 (sharp) Setosa: 99.99% Versicolor: 0.01% Virginica: ~0% Winner-takes-all τ = 1.0 (standard) Setosa: 74.25% 20.24% 5.52% Default setting τ = 5.0 (soft) 40.3% 32.0% 27.8% Near-uniform

Softmax + Cross-Entropy: The Combined Gradient

The cross-entropy loss for a one-hot label y = [1, 0, 0] (setosa is correct):

L = −Σᵢ yᵢ log(pᵢ) = −[1·log(0.7425) + 0·log(0.2024) + 0·log(0.0552)]

L = −log(0.7425) = 0.2979

The gradient ∂L/∂zᵢ is derived by applying the chain rule through the softmax to the cross-entropy. The full derivation involves two cases:

For i = j (the correct class): ∂L/∂zᵢ = pᵢ − 1 For i ≠ j (other classes): ∂L/∂zᵢ = pᵢ

Combined: ∂L/∂zᵢ = pᵢ − yᵢ

This is the "subtract one from the correct class" rule. Computing the gradients for our anchor:

∂L/∂z₁ = 0.7425 − 1 = −0.2575 (setosa — reduce this logit slightly, it was over-confident)

∂L/∂z₂ = 0.2024 − 0 = +0.2024 (versicolor — push this logit down)

∂L/∂z₃ = 0.0552 − 0 = +0.0552 (virginica — push this logit down)

The negative gradient for the correct class means the loss decreases when z₁ increases — so the optimizer will increase z₁ and decrease z₂, z₃. Sum of all gradients: −0.2575 + 0.2024 + 0.0552 = 0.0001 ≈ 0 ✓ (gradients sum to zero).


Numerical Stability

The problem: For large logits like z = [100, 200, 300], e³⁰⁰ = 2 × 10¹³⁰ — IEEE 754 float64 overflows to inf. The division then produces NaN.

The fix: Subtract max(z) before exponentiating. This is mathematically equivalent because:

softmax(z)ᵢ = eᶻⁱ / Σeᶻʲ = eᶻⁱ⁻ᶜ / Σeᶻʲ⁻ᶜ (for any constant c)

Setting c = max(z): z' = z − max(z) = [2.1−2.1, 0.8−2.1, −0.5−2.1] = [0, −1.3, −2.6]

e⁰ = 1, e⁻¹·³ = 0.2725, e⁻²·⁶ = 0.0743

Sum = 1 + 0.2725 + 0.0743 = 1.3468

p = [1/1.3468, 0.2725/1.3468, 0.0743/1.3468] = [0.7425, 0.2024, 0.0552] ✓

Same result, no overflow. The largest value is always e⁰ = 1 after subtracting the max.


When to Use Softmax

Use softmax when:

  • Multiclass classification output layer with mutually exclusive classes (exactly one class is correct)
  • You need calibrated probability estimates across all K classes

Do not use softmax when:

  • Multi-label classification (a sample can belong to multiple classes simultaneously) — use sigmoid per output neuron independently
  • Hidden layers — softmax is only for output layers
  • Binary classification — sigmoid is simpler and equivalent for K=2

Code

python
import numpy as np

def softmax(z, T=1.0):
    z_shifted = z - np.max(z)   # numerical stability
    e = np.exp(z_shifted / T)
    return e / e.sum()

def cross_entropy(p, y):
    return -np.sum(y * np.log(p + 1e-8))

z = np.array([2.1, 0.8, -0.5])
y = np.array([1, 0, 0])

p = softmax(z)
loss = cross_entropy(p, y)
grad = p - y

print(f"Logits z:        {z}")
print(f"Probabilities p: {np.round(p, 4)}")
print(f"CE Loss:         {loss:.4f}")
print(f"Gradient ∂L/∂z:  {np.round(grad, 4)}")
print(f"Gradient sum:    {grad.sum():.6f}  (should be ≈0)")
print()
print("Temperature sweep:")
for T in [0.1, 1.0, 5.0]:
    pt = softmax(z, T)
    print(f"  τ={T}: {np.round(pt, 4)}")
text
Logits z:        [ 2.1  0.8 -0.5]
Probabilities p: [0.7425 0.2024 0.0552]
CE Loss:         0.2979
Gradient ∂L/∂z:  [-0.2575  0.2024  0.0552]
Gradient sum:    0.000001  (should be ≈0)

Temperature sweep:
  τ=0.1: [9.999e-01 7.664e-05 4.440e-07]
  τ=1.0: [0.7425 0.2024 0.0552]
  τ=5.0: [0.4026 0.3196 0.2778]

Honest Limitations

Softmax forces a class. The outputs always sum to 1, so some class gets assigned high probability even when the input is nothing like any training class. An image of a dog sent through an iris classifier would receive high probability for some species — softmax cannot say "none of the above." For open-set recognition, separate thresholding or calibration is needed.

Overconfident on out-of-distribution inputs. Softmax probabilities are not well-calibrated by default. A network can output 99.9% confidence on an input it has never seen because the calibration depends on the training distribution. Platt scaling or temperature scaling are standard post-hoc fixes.

K-class requires K output neurons. For 10,000 classes (e.g., ImageNet), the output layer has 10,000 neurons and the weight matrix between the last hidden layer and the output is large. Hierarchical softmax is used in extreme multiclass settings (vocabulary of 100,000+ words in language models) to make this tractable.


Where this builds from: Categorical cross-entropy is the loss paired with softmax. The gradient simplification (∂L/∂zᵢ = pᵢ − yᵢ) comes from the combined derivative of softmax and cross-entropy — the chain rule through both simultaneously collapses to this clean form. One-hot encoding is required for the y vector.

Where this leads: Multiclass output layers in every classification network use softmax. Temperature scaling of softmax is the basis of knowledge distillation, where a large teacher model's softened outputs train a smaller student model. The loss function post covers cross-entropy in more depth.


Test Your Understanding

  1. Compute softmax([1.5, 1.5, 1.5]) by hand. What does the result tell you about how softmax behaves when all logits are equal? What is the gradient ∂L/∂zᵢ for the correct class (say class 1) in this case?

  2. The logit vector before softmax is z = [3.0, 0.1, 0.1]. Compute the softmax probabilities. Now add 5 to every logit: z' = [8.0, 5.1, 5.1]. Compute softmax(z'). Are the probabilities the same? Why does subtracting max(z) for numerical stability not change the output?

  3. A student says "I can use softmax for multi-label classification by just treating each output independently and thresholding at 0.5." Explain specifically why softmax is wrong for multi-label tasks, even if the threshold trick happens to give reasonable results on balanced datasets.

  4. During training with temperature τ = 1.0, the network produces z = [10.0, 0.2, 0.3] for an image that is truly class 1 (setosa). The loss is very small because p₁ ≈ 1.0. During knowledge distillation with τ = 4.0, the teacher uses these same logits. Compute the soft labels (probabilities) under τ = 4.0 and explain why they provide richer information to the student than one-hot labels.

  5. Softmax outputs sum to 1, so the output of K neurons defines a probability over K classes. For binary classification (K=2), softmax and sigmoid produce equivalent results. Prove this algebraically: show that softmax([z, 0])₁ = σ(z).

Comments (0)

No comments yet. Be the first to comment!

Leave a comment