~/blog

Classification Loss Functions

Jul 1, 202610 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Regression loss measures distance — how far is the prediction from the true number. Classification loss measures information — how surprised are you by the true label given the predicted probability distribution? The answer to that question, formalized through information theory, is cross-entropy.

Two anchors:

  • Binary: churn prediction — y=[0,1,0,1,0], ŷ=[0.541,0.823,0.312,0.791,0.458]
  • Multiclass: iris 3-class — y=[1,0,0] (setosa), ŷ=[0.742,0.202,0.055]

Binary Cross-Entropy (BCE)

L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

Where It Comes From

A binary label y follows a Bernoulli distribution with parameter ŷ:

P(y | ŷ) = ŷʸ × (1−ŷ)¹⁻ʸ

Step 1 — take the negative log-likelihood:

−log P(y | ŷ) = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

Step 2 — that's BCE. Minimizing BCE is equivalent to maximum likelihood estimation of the Bernoulli parameter.

When y = 1: the formula reduces to −log(ŷ). When y = 0: it reduces to −log(1−ŷ). Both terms penalize overconfidence in the wrong direction: predicting ŷ → 0 when y = 1 drives −log(ŷ) → ∞.

Computing BCE for the Binary Anchor

SampleyŷFormulaL
100.541−log(1−0.541) = −log(0.459)0.779
210.823−log(0.823)0.195
300.312−log(1−0.312) = −log(0.688)0.375
410.791−log(0.791)0.234
500.458−log(1−0.458) = −log(0.542)0.612

Cost = (0.779 + 0.195 + 0.375 + 0.234 + 0.612) / 5 = 0.439

Sample 1 has the highest loss: the network predicts 54.1% churn (more likely churned than not) but the true label is 0. Sample 2 has the lowest: 82.3% confidence for y=1 is rewarded with a low loss of 0.195.

BCE Loss — Wrong Confident Predictions Penalized Heavily Predicted probability ŷ (0 → 1) BCE Loss 0 0.5 1 y=1: −log(ŷ) correct → ŷ should be high y=0: −log(1−ŷ) correct → ŷ should be low ŷ=0.5, L=0.693

BCE Gradient

∂L/∂ŷ = (ŷ − y) / (ŷ(1−ŷ))

But when BCE is combined with sigmoid at the output, this simplifies to just ŷ − y. The ŷ(1−ŷ) denominator from BCE cancels with the σ'(z) = ŷ(1−ŷ) numerator from sigmoid's derivative. This cancellation is why sigmoid + BCE is the standard choice — the gradient is clean and does not saturate.


Categorical Cross-Entropy (CCE)

For K-class classification with one-hot label y = [y₁, y₂, ..., yₖ]:

L = −Σₖ yₖ · log(ŷₖ)

Because y is one-hot (exactly one yₖ = 1, all others = 0), only one term survives:

L = −log(ŷ_correct)

For the iris anchor: y = [1, 0, 0] (setosa), ŷ = [0.742, 0.202, 0.055]:

L = −[1·log(0.742) + 0·log(0.202) + 0·log(0.055)] = −log(0.742) = 0.298

CCE for Three Different Prediction Distributions

Prediction ŷCorrect class confidenceCCE = −log(ŷ_correct)
[0.742, 0.202, 0.055]74.2% (setosa)−log(0.742) = 0.298
[0.300, 0.500, 0.200]30.0% (setosa)−log(0.300) = 1.204
[0.950, 0.030, 0.020]95.0% (setosa)−log(0.950) = 0.051

Going from 74.2% to 95% confidence reduces the loss from 0.298 to 0.051. Going from 74.2% to 30% inflates the loss to 1.204. CCE strongly rewards confident correct predictions.

CCE — Three Prediction Distributions for Setosa ŷ=[0.742,0.202,0.055] CCE=0.298 74% Setosa Versi. Virgin. ŷ=[0.3,0.5,0.2] CCE=1.204 30% Setosa Versi. Virgin. ŷ=[0.95,0.03,0.02] CCE=0.051 95% Setosa Versi. Virgin. Blue = correct class (setosa). Higher correct class probability → lower CCE.

Sparse Categorical Cross-Entropy

Identical math to CCE — only the label format differs:

  • CCE: y = [1, 0, 0] (one-hot vector, length K)
  • Sparse CCE: y = 0 (class index as integer)

Both compute L = −log(ŷ[y_true_class]). For the iris anchor: L = −log(ŷ[0]) = −log(0.742) = 0.298 — exactly the same result.

Sparse CCE is preferred when K is large (ImageNet: 1000 classes). Storing one-hot vectors of length 1000 per sample costs 1000× more memory than storing a single integer.


Hinge Loss

L = max(0, 1 − y · ŷ) where y ∈ {−1, +1}

Hinge loss is the SVM loss function. In DL it is used with tanh output (ŷ ∈ (−1, +1)) for binary classification.

Three cases with y = +1:

Caseŷy·ŷ1−y·ŷL = max(0, ...)
Correct + confident0.90.90.1max(0, 0.1) = 0.1
Correct + uncertain0.50.50.5max(0, 0.5) = 0.5
Wrong−0.5−0.51.5max(0, 1.5) = 1.5
Very wrong−0.9−0.91.9max(0, 1.9) = 1.9

Hinge loss is zero when the prediction is both correct and confident (margin y·ŷ > 1). It does not reward extra-confident predictions beyond the margin. BCE, by contrast, always has positive loss (log(ŷ) < 0 for any ŷ < 1) and continues to reward predictions that move toward 0 or 1.

Hinge vs BCE Loss for y=+1 Prediction ŷ (−1 → +1 for hinge; 0 → 1 for BCE) Hinge (piecewise linear) zero when margin ≥ 1 BCE (smooth log curve) margin=1 cutoff

Focal Loss (for Imbalanced Datasets)

FL(pₜ) = −(1 − pₜ)^γ · log(pₜ)

where pₜ is the model's predicted probability for the true class.

When γ = 0: Focal Loss = standard cross-entropy. When γ > 0, easy examples (high pₜ) are down-weighted so the model focuses on hard, misclassified examples.

For an easy sample (pₜ = 0.9) with γ ∈ {0, 1, 2}:

γ(1 − pₜ)^γ−log(0.9)Focal Loss
01.00.10540.1054
1(1−0.9)¹ = 0.10.10540.1 × 0.1054 = 0.0105
2(1−0.9)² = 0.010.10540.01 × 0.1054 = 0.0011

With γ = 2, the easy sample contributes 100× less to the loss than it would with standard CE. The optimizer's gradient budget is spent almost entirely on hard examples (the rare class in imbalanced detection tasks like RetinaNet). Focal loss is not a general-purpose classification loss — it is designed specifically for class imbalance where easy negatives overwhelm the training signal.


Comparison Table

LossUse caseOutput activationLabel format
BCEBinary classificationSigmoid0 or 1
CCEMulticlass (mutually exclusive)SoftmaxOne-hot
Sparse CCEMulticlass (large K)SoftmaxInteger index
HingeSVM-style binaryTanh or linear−1 or +1
FocalImbalanced classificationSigmoid0 or 1

Code

python
import numpy as np

# Binary CE
y_b    = np.array([0, 1, 0, 1, 0])
yhat_b = np.array([0.541, 0.823, 0.312, 0.791, 0.458])
bce = -np.mean(y_b * np.log(yhat_b + 1e-8) + (1 - y_b) * np.log(1 - yhat_b + 1e-8))

# Categorical CE
y_c    = np.array([1, 0, 0])
yhat_c = np.array([0.742, 0.202, 0.055])
cce = -np.sum(y_c * np.log(yhat_c + 1e-8))

# Hinge loss (y in {-1, +1})
y_h    = np.array([ 1,   1,  -1])
yhat_h = np.array([0.9, 0.5, 0.5])
hinge  = np.mean(np.maximum(0, 1 - y_h * yhat_h))

# Focal loss for easy sample
pt = 0.9
print("Focal loss comparison (pₜ=0.9):")
for gamma in [0, 1, 2]:
    fl = -(1 - pt)**gamma * np.log(pt + 1e-8)
    print(f"  γ={gamma}: FL = {fl:.4f}")

print(f"\nBCE:   {bce:.4f}")
print(f"CCE:   {cce:.4f}")
print(f"Hinge: {hinge:.4f}")
text
Focal loss comparison (pₜ=0.9):
  γ=0: FL = 0.1054
  γ=1: FL = 0.0105
  γ=2: FL = 0.0011

BCE:   0.4390
CCE:   0.2978
Hinge: 0.2500

Where this builds from: Loss vs cost (01) established that these formulas compute per-sample losses. Sigmoid and softmax activations (section 3) produce the probability outputs these losses consume. The BCE gradient simplification (ŷ − y when paired with sigmoid) is a direct consequence of the softmax/sigmoid activation posts.

Where this leads: The loss guide (04) synthesizes when to use each loss. Focal loss is the standard loss for one-stage object detectors (RetinaNet). CCE gradients flow back through softmax into the encoder layers of transformer models.


Honest Limitations

BCE assumes independent binary outputs. When you apply BCE per-output in a multi-label problem, you are assuming each class is independently predicted. This is an approximation — in practice, class co-occurrence patterns (a cat is more likely to also be "animal") are not modeled. Multi-label models sometimes use structured prediction losses that capture this, though at significant complexity cost.

CCE is sensitive to label smoothing. One-hot labels assert 100% confidence in a single class. In practice, annotators make mistakes and some samples are genuinely ambiguous. Label smoothing replaces [1, 0, 0] with [0.9, 0.05, 0.05], reducing overconfidence. Raw CCE with one-hot labels can cause the model to push class probabilities to extreme values, which harms calibration.

Hinge loss does not produce calibrated probabilities. SVM with hinge loss finds the maximum-margin hyperplane, but the output ŷ is a score, not a probability. Platt scaling or isotonic regression must be applied post-training to convert hinge scores to calibrated probabilities.


Test Your Understanding

  1. For a sample with y = 0 and ŷ = 0.9, compute the BCE loss. Then compute the BCE loss for y = 0, ŷ = 0.1. The network is far more wrong in the first case. How much larger is the first loss? What property of the log function causes this penalty to be so severe?

  2. CCE only penalizes incorrect predictions on the correct class. If the true class is setosa (class 0) and the predictions are [0.4, 0.55, 0.05], the loss is −log(0.4) ≈ 0.916. If instead they are [0.4, 0.3, 0.3], the loss is identical. Does the distribution among incorrect classes matter for the loss? Does it matter for the gradient? Explain.

  3. Hinge loss is zero when the margin exceeds 1. For a correct prediction (y = +1) with ŷ = 0.95, the loss is max(0, 1 − 0.95) = 0.05. With BCE, this same sample would have loss −log(sigmoid(0.95)) — which is also small but nonzero. In what training scenario would you prefer the hard zero of hinge over BCE's soft log penalty?

  4. Focal loss with γ = 2 reduces the contribution of easy samples by (1−pₜ)². For a hard sample (pₜ = 0.2), compute how much focal loss changes compared to standard CE. Then compare this with an easy sample (pₜ = 0.9). What is the ratio of hard-sample to easy-sample focal loss at γ = 2?

  5. A multi-label image classifier must predict whether each of 80 COCO categories is present in an image. You apply sigmoid to each output and use BCE per output. An image is fully background (all 80 labels = 0), but the model predicts class 5 at 0.6 and all others below 0.1. Compute the contribution of the class 5 output to the total BCE loss. Does the 1/80 averaging affect whether the gradient for class 5 is meaningful?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment