~/blog
Which Loss Function to Use When
You have now seen five loss functions: MSE, MAE, Huber, BCE, CCE, Sparse CCE, Hinge, and Focal. The first question when starting a new supervised learning task is not which optimizer to use — it is what kind of output your task requires. That one decision determines the loss function, the output activation, and the label format simultaneously.
This post is a synthesis. No new derivations.
Decision Flowchart
Task-Loss-Activation Reference Table
| Task | Loss | Output activation | Label format |
|---|---|---|---|
| Regression | MSE | None (linear) | Float |
| Regression (with outliers) | Huber or MAE | None (linear) | Float |
| Binary classification | BCE | Sigmoid | 0 or 1 |
| Multiclass classification | CCE | Softmax | One-hot vector |
| Multiclass (large K) | Sparse CCE | Softmax | Integer index |
| Multi-label classification | BCE per output | Sigmoid per output | Binary vector |
| Object detection | Focal + L1/L2 | Sigmoid + linear | Class + bounding box |
| Imbalanced binary | Focal | Sigmoid | 0 or 1 |
Common Mistakes
Mistake 1: Using MSE for classification.
MSE treats predicted probabilities as if they were real numbers and penalizes their distance from the true label. For y = 1 and ŷ = 0.3, MSE gradient = 2(ŷ − y) = −1.4. BCE gradient at the same point (with sigmoid) = ŷ − y = −0.7. But the deeper problem is that MSE does not have the right curvature: it does not heavily penalize confident wrong predictions the way −log(ŷ) does. A model trained with MSE on classification often produces poorly calibrated probabilities and converges slower. Fix: always use BCE for binary classification and CCE for multiclass.
Mistake 2: Using CCE without softmax (sigmoid for multiclass outputs).
Softmax ensures the K outputs sum to 1, which is the correct constraint when exactly one class is true. If you apply sigmoid to each logit independently, each output is in (0,1) but they do not sum to 1. A model can output [0.9, 0.8, 0.7] simultaneously — this is mathematically inconsistent for mutually exclusive classes. Fix: always use softmax + CCE for mutually exclusive multiclass.
Mistake 3: Using BCE for multi-label as a single output.
If a sample can belong to multiple classes simultaneously (an image contains both a dog and a cat), a single BCE output cannot represent multiple classes at once. You need one sigmoid + BCE per class, each outputting a binary probability independently. Fix: K output neurons, K sigmoid activations, K BCE losses averaged.
Mistake 4: Missing numerical stability clip in log.
np.log(ŷ) returns −∞ when ŷ = 0. In a training run, a softmax output can be exactly 0.0 for a class if the logit is very negative. This makes the entire cost NaN and the run fails silently or with NaN weights. Fix: always clip: np.log(ŷ + 1e-8) or use framework's built-in cross-entropy which handles this internally.
Framework Defaults
Keras / TensorFlow:
BinaryCrossentropy(from_logits=False)— input is already sigmoid output, applies −[y log(ŷ) + ...]BinaryCrossentropy(from_logits=True)— input is raw logit, applies sigmoid internally then BCE. This is numerically more stable because it avoids computing sigmoid then log(sigmoid), which can cause precision loss.CategoricalCrossentropy— CCE with one-hot labelsSparseCategoricalCrossentropy— CCE with integer labels
The difference between from_logits=True and from_logits=False matters only for numerical stability, not for the mathematical result. from_logits=True is preferred because it uses the log-sum-exp trick internally to avoid computing exp then log in sequence.
PyTorch:
nn.BCELoss()— takes sigmoid outputnn.BCEWithLogitsLoss()— takes raw logit (numerically stable, preferred)nn.CrossEntropyLoss()— takes raw logits, applies log-softmax internally (sparse labels)nn.NLLLoss()— takes log-softmax output explicitly
Code
import numpy as np
def select_loss(task):
guide = {
'regression': 'MSE (or Huber if outliers present)',
'binary_clf': 'BinaryCrossEntropy + Sigmoid',
'multiclass': 'CategoricalCrossEntropy + Softmax',
'multilabel': 'BinaryCrossEntropy per output + Sigmoid per output',
'imbalanced_binary': 'FocalLoss + Sigmoid',
'regression_outlier':'Huber or MAE + Linear (no activation)',
}
return guide.get(task, 'Unknown task — check task type first')
tasks = list(guide := {
'regression': 'MSE (or Huber if outliers present)',
'binary_clf': 'BinaryCrossEntropy + Sigmoid',
'multiclass': 'CategoricalCrossEntropy + Softmax',
'multilabel': 'BinaryCrossEntropy per output + Sigmoid per output',
'imbalanced_binary': 'FocalLoss + Sigmoid',
'regression_outlier':'Huber or MAE + Linear (no activation)',
})
for task, rec in guide.items():
print(f"{task:<20}: {rec}")regression : MSE (or Huber if outliers present)
binary_clf : BinaryCrossEntropy + Sigmoid
multiclass : CategoricalCrossEntropy + Softmax
multilabel : BinaryCrossEntropy per output + Sigmoid per output
imbalanced_binary : FocalLoss + Sigmoid
regression_outlier : Huber or MAE + Linear (no activation)Related Concepts
Where this builds from: Regression losses (02) and classification losses (03) defined the individual loss formulas. This post is the decision layer on top of those formulas.
Where this leads: The choice of loss function affects optimizer behavior — specifically, the gradient signal the optimizer receives. Section 5 (optimizers) will use BCE + the churn network as the primary example; the loss choice established here sets up that example. Section 6 (regularization) adds terms to the cost function: the regularization post modifies J, not L — that distinction comes from 01-loss-vs-cost.
Honest Limitations
The table assumes clean task definitions. Real problems often blur categories: self-supervised learning uses reconstruction loss (MSE) but the downstream task is classification. Generative models (VAEs, GANs) use custom loss formulations that don't fit the table. The table covers the 80% case.
Loss function choice affects convergence speed, not just correctness. Even if you choose a theoretically valid loss (e.g., MSE for regression), a poorly scaled MSE (with unnormalized y values in the millions) will produce huge gradients that destabilize training. Loss and data normalization are coupled: normalize y before applying MSE in regression.
Framework from_logits confusion causes silent bugs. If you use BinaryCrossentropy(from_logits=False) but pass raw logits (not sigmoid output), the loss will be computed on wrong values. The model will still train but will converge to wrong weights. Always double-check whether your model's last layer applies sigmoid/softmax before the loss.
Test Your Understanding
-
A model is trained to predict movie ratings (1–5 stars). One developer proposes MSE. Another proposes CCE with 5 output neurons (one per star). Both are technically valid. Compare the gradients each would produce for a sample where y=4 and the model outputs y_hat=2 (MSE) vs one-hot y=[0,0,0,1,0] and softmax ŷ=[0.1, 0.2, 0.5, 0.15, 0.05] (CCE). Which loss function better captures the ordinal relationship between star ratings?
-
You train a sentiment classifier (positive/negative) with BCE and achieve 92% accuracy. You then switch to MSE (treating labels as 0 and 1). The accuracy drops to 87% with all other hyperparameters unchanged. Identify two specific reasons why MSE performs worse for this binary classification task, with reference to the gradient computation.
-
Object detection loss = L_cls + λ × L_reg where L_cls is Focal loss for class prediction and L_reg is smooth-L1 (Huber) for bounding box regression. Why must the bounding box regression use a regression loss rather than a classification loss? What would go wrong if you discretized bounding box coordinates into classes and used CCE?
-
Keras
from_logits=Trueandfrom_logits=Falseproduce the same mathematical result butfrom_logits=Trueis numerically more stable. Explain the specific numerical issue that arises when computing sigmoid followed by log: why doeslog(sigmoid(z))lose precision for large negative z, and how does the log-sum-exp trick avoid it? -
A team builds a multi-label classifier for 1000 medical codes (each patient can have multiple diagnoses). They use BCE per output with sigmoid. After training, they observe that the model almost always predicts near-zero probability for all 1000 codes — the "predicting nothing" solution. What loss function phenomenon causes this? What class-weighting or loss modification would fix it?