~/blog

Classification Loss Functions

Jul 1, 2026•10 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

Regression loss measures distance — how far is the prediction from the true number. Classification loss measures information — how surprised are you by the true label given the predicted probability distribution? The answer to that question, formalized through information theory, is cross-entropy.

Two anchors:

Binary: churn prediction — y=[0,1,0,1,0], ŷ=[0.541,0.823,0.312,0.791,0.458]
Multiclass: iris 3-class — y=[1,0,0] (setosa), ŷ=[0.742,0.202,0.055]

Binary Cross-Entropy (BCE)

L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

Where It Comes From

A binary label y follows a Bernoulli distribution with parameter ŷ:

P(y | ŷ) = ŷʸ × (1−ŷ)¹⁻ʸ

Step 1 — take the negative log-likelihood:

−log P(y | ŷ) = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

Step 2 — that's BCE. Minimizing BCE is equivalent to maximum likelihood estimation of the Bernoulli parameter.

When y = 1: the formula reduces to −log(ŷ). When y = 0: it reduces to −log(1−ŷ). Both terms penalize overconfidence in the wrong direction: predicting ŷ → 0 when y = 1 drives −log(ŷ) → ∞.

Computing BCE for the Binary Anchor

Sample	y	ŷ	Formula	L
1	0	0.541	−log(1−0.541) = −log(0.459)	0.779
2	1	0.823	−log(0.823)	0.195
3	0	0.312	−log(1−0.312) = −log(0.688)	0.375
4	1	0.791	−log(0.791)	0.234
5	0	0.458	−log(1−0.458) = −log(0.542)	0.612

Cost = (0.779 + 0.195 + 0.375 + 0.234 + 0.612) / 5 = 0.439

Sample 1 has the highest loss: the network predicts 54.1% churn (more likely churned than not) but the true label is 0. Sample 2 has the lowest: 82.3% confidence for y=1 is rewarded with a low loss of 0.195.

BCE Gradient

∂L/∂ŷ = (ŷ − y) / (ŷ(1−ŷ))

But when BCE is combined with sigmoid at the output, this simplifies to just ŷ − y. The ŷ(1−ŷ) denominator from BCE cancels with the σ'(z) = ŷ(1−ŷ) numerator from sigmoid's derivative. This cancellation is why sigmoid + BCE is the standard choice — the gradient is clean and does not saturate.

Categorical Cross-Entropy (CCE)

For K-class classification with one-hot label y = [y₁, y₂, ..., yₖ]:

L = −Σₖ yₖ · log(ŷₖ)

Because y is one-hot (exactly one yₖ = 1, all others = 0), only one term survives:

L = −log(ŷ_correct)

For the iris anchor: y = [1, 0, 0] (setosa), ŷ = [0.742, 0.202, 0.055]:

L = −[1·log(0.742) + 0·log(0.202) + 0·log(0.055)] = −log(0.742) = 0.298

CCE for Three Different Prediction Distributions

Prediction ŷ	Correct class confidence	CCE = −log(ŷ_correct)
[0.742, 0.202, 0.055]	74.2% (setosa)	−log(0.742) = 0.298
[0.300, 0.500, 0.200]	30.0% (setosa)	−log(0.300) = 1.204
[0.950, 0.030, 0.020]	95.0% (setosa)	−log(0.950) = 0.051

Going from 74.2% to 95% confidence reduces the loss from 0.298 to 0.051. Going from 74.2% to 30% inflates the loss to 1.204. CCE strongly rewards confident correct predictions.

Sparse Categorical Cross-Entropy

Identical math to CCE — only the label format differs:

CCE: y = [1, 0, 0] (one-hot vector, length K)
Sparse CCE: y = 0 (class index as integer)

Both compute L = −log(ŷ[y_true_class]). For the iris anchor: L = −log(ŷ[0]) = −log(0.742) = 0.298 — exactly the same result.

Sparse CCE is preferred when K is large (ImageNet: 1000 classes). Storing one-hot vectors of length 1000 per sample costs 1000× more memory than storing a single integer.

Hinge Loss

L = max(0, 1 − y · ŷ) where y ∈ {−1, +1}

Hinge loss is the SVM loss function. In DL it is used with tanh output (ŷ ∈ (−1, +1)) for binary classification.

Three cases with y = +1:

Case	ŷ	y·ŷ	1−y·ŷ	L = max(0, ...)
Correct + confident	0.9	0.9	0.1	max(0, 0.1) = 0.1
Correct + uncertain	0.5	0.5	0.5	max(0, 0.5) = 0.5
Wrong	−0.5	−0.5	1.5	max(0, 1.5) = 1.5
Very wrong	−0.9	−0.9	1.9	max(0, 1.9) = 1.9

Hinge loss is zero when the prediction is both correct and confident (margin y·ŷ > 1). It does not reward extra-confident predictions beyond the margin. BCE, by contrast, always has positive loss (log(ŷ) < 0 for any ŷ < 1) and continues to reward predictions that move toward 0 or 1.

Focal Loss (for Imbalanced Datasets)

FL(pₜ) = −(1 − pₜ)^γ · log(pₜ)

where pₜ is the model's predicted probability for the true class.

When γ = 0: Focal Loss = standard cross-entropy. When γ > 0, easy examples (high pₜ) are down-weighted so the model focuses on hard, misclassified examples.

For an easy sample (pₜ = 0.9) with γ ∈ {0, 1, 2}:

γ	(1 − pₜ)^γ	−log(0.9)	Focal Loss
0	1.0	0.1054	0.1054
1	(1−0.9)¹ = 0.1	0.1054	0.1 × 0.1054 = 0.0105
2	(1−0.9)² = 0.01	0.1054	0.01 × 0.1054 = 0.0011

With γ = 2, the easy sample contributes 100× less to the loss than it would with standard CE. The optimizer's gradient budget is spent almost entirely on hard examples (the rare class in imbalanced detection tasks like RetinaNet). Focal loss is not a general-purpose classification loss — it is designed specifically for class imbalance where easy negatives overwhelm the training signal.

Comparison Table

Loss	Use case	Output activation	Label format
BCE	Binary classification	Sigmoid	0 or 1
CCE	Multiclass (mutually exclusive)	Softmax	One-hot
Sparse CCE	Multiclass (large K)	Softmax	Integer index
Hinge	SVM-style binary	Tanh or linear	−1 or +1
Focal	Imbalanced classification	Sigmoid	0 or 1

Code

python

import numpy as np

# Binary CE
y_b    = np.array([0, 1, 0, 1, 0])
yhat_b = np.array([0.541, 0.823, 0.312, 0.791, 0.458])
bce = -np.mean(y_b * np.log(yhat_b + 1e-8) + (1 - y_b) * np.log(1 - yhat_b + 1e-8))

# Categorical CE
y_c    = np.array([1, 0, 0])
yhat_c = np.array([0.742, 0.202, 0.055])
cce = -np.sum(y_c * np.log(yhat_c + 1e-8))

# Hinge loss (y in {-1, +1})
y_h    = np.array([ 1,   1,  -1])
yhat_h = np.array([0.9, 0.5, 0.5])
hinge  = np.mean(np.maximum(0, 1 - y_h * yhat_h))

# Focal loss for easy sample
pt = 0.9
print("Focal loss comparison (pₜ=0.9):")
for gamma in [0, 1, 2]:
    fl = -(1 - pt)**gamma * np.log(pt + 1e-8)
    print(f"  γ={gamma}: FL = {fl:.4f}")

print(f"\nBCE:   {bce:.4f}")
print(f"CCE:   {cce:.4f}")
print(f"Hinge: {hinge:.4f}")

text

Focal loss comparison (pₜ=0.9):
  γ=0: FL = 0.1054
  γ=1: FL = 0.0105
  γ=2: FL = 0.0011

BCE:   0.4390
CCE:   0.2978
Hinge: 0.2500

Where this builds from: Loss vs cost (01) established that these formulas compute per-sample losses. Sigmoid and softmax activations (section 3) produce the probability outputs these losses consume. The BCE gradient simplification (ŷ − y when paired with sigmoid) is a direct consequence of the softmax/sigmoid activation posts.

Where this leads: The loss guide (04) synthesizes when to use each loss. Focal loss is the standard loss for one-stage object detectors (RetinaNet). CCE gradients flow back through softmax into the encoder layers of transformer models.

Honest Limitations

BCE assumes independent binary outputs. When you apply BCE per-output in a multi-label problem, you are assuming each class is independently predicted. This is an approximation — in practice, class co-occurrence patterns (a cat is more likely to also be "animal") are not modeled. Multi-label models sometimes use structured prediction losses that capture this, though at significant complexity cost.

CCE is sensitive to label smoothing. One-hot labels assert 100% confidence in a single class. In practice, annotators make mistakes and some samples are genuinely ambiguous. Label smoothing replaces [1, 0, 0] with [0.9, 0.05, 0.05], reducing overconfidence. Raw CCE with one-hot labels can cause the model to push class probabilities to extreme values, which harms calibration.

Hinge loss does not produce calibrated probabilities. SVM with hinge loss finds the maximum-margin hyperplane, but the output ŷ is a score, not a probability. Platt scaling or isotonic regression must be applied post-training to convert hinge scores to calibrated probabilities.

Test Your Understanding

For a sample with y = 0 and ŷ = 0.9, compute the BCE loss. Then compute the BCE loss for y = 0, ŷ = 0.1. The network is far more wrong in the first case. How much larger is the first loss? What property of the log function causes this penalty to be so severe?
CCE only penalizes incorrect predictions on the correct class. If the true class is setosa (class 0) and the predictions are [0.4, 0.55, 0.05], the loss is −log(0.4) ≈ 0.916. If instead they are [0.4, 0.3, 0.3], the loss is identical. Does the distribution among incorrect classes matter for the loss? Does it matter for the gradient? Explain.
Hinge loss is zero when the margin exceeds 1. For a correct prediction (y = +1) with ŷ = 0.95, the loss is max(0, 1 − 0.95) = 0.05. With BCE, this same sample would have loss −log(sigmoid(0.95)) — which is also small but nonzero. In what training scenario would you prefer the hard zero of hinge over BCE's soft log penalty?
Focal loss with γ = 2 reduces the contribution of easy samples by (1−pₜ)². For a hard sample (pₜ = 0.2), compute how much focal loss changes compared to standard CE. Then compare this with an easy sample (pₜ = 0.9). What is the ratio of hard-sample to easy-sample focal loss at γ = 2?
A multi-label image classifier must predict whether each of 80 COCO categories is present in an image. You apply sigmoid to each output and use BCE per output. An image is fully background (all 80 labels = 0), but the model predicts class 5 at 0.6 and all others below 0.1. Compute the contribution of the class 5 output to the total BCE loss. Does the 1/80 averaging affect whether the gradient for class 5 is meaningful?

Classification Loss Functions

Binary Cross-Entropy (BCE)

Where It Comes From

Computing BCE for the Binary Anchor

BCE Gradient

Categorical Cross-Entropy (CCE)

CCE for Three Different Prediction Distributions

Sparse Categorical Cross-Entropy

Hinge Loss

Focal Loss (for Imbalanced Datasets)

Comparison Table

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Classification Loss Functions

Binary Cross-Entropy (BCE)

Where It Comes From

Computing BCE for the Binary Anchor

BCE Gradient

Categorical Cross-Entropy (CCE)

CCE for Three Different Prediction Distributions

Sparse Categorical Cross-Entropy

Hinge Loss

Focal Loss (for Imbalanced Datasets)

Comparison Table

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment