~/blog

Which Loss Function to Use When

Jul 1, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

You have now seen five loss functions: MSE, MAE, Huber, BCE, CCE, Sparse CCE, Hinge, and Focal. The first question when starting a new supervised learning task is not which optimizer to use — it is what kind of output your task requires. That one decision determines the loss function, the output activation, and the label format simultaneously.

This post is a synthesis. No new derivations.


Decision Flowchart

Loss Function Selection Flowchart What is your output type? continuous imbalanced binary multiclass Are outliers present? → BCE + Sigmoid → Focal + Sigmoid Yes (outliers) No → MSE → Huber or MAE Mutually exclusive? multiclass Yes → CCE + Softmax No (multi-label) → BCE per output Sigmoid per output Large K (many classes)? → Sparse CCE + Softmax (integer labels, memory efficient) Both CCE and Sparse CCE produce the same gradients — label format is the only difference.

Task-Loss-Activation Reference Table

TaskLossOutput activationLabel format
RegressionMSENone (linear)Float
Regression (with outliers)Huber or MAENone (linear)Float
Binary classificationBCESigmoid0 or 1
Multiclass classificationCCESoftmaxOne-hot vector
Multiclass (large K)Sparse CCESoftmaxInteger index
Multi-label classificationBCE per outputSigmoid per outputBinary vector
Object detectionFocal + L1/L2Sigmoid + linearClass + bounding box
Imbalanced binaryFocalSigmoid0 or 1

Common Mistakes

Mistake 1: Using MSE for classification.

MSE treats predicted probabilities as if they were real numbers and penalizes their distance from the true label. For y = 1 and ŷ = 0.3, MSE gradient = 2(ŷ − y) = −1.4. BCE gradient at the same point (with sigmoid) = ŷ − y = −0.7. But the deeper problem is that MSE does not have the right curvature: it does not heavily penalize confident wrong predictions the way −log(ŷ) does. A model trained with MSE on classification often produces poorly calibrated probabilities and converges slower. Fix: always use BCE for binary classification and CCE for multiclass.

Mistake 2: Using CCE without softmax (sigmoid for multiclass outputs).

Softmax ensures the K outputs sum to 1, which is the correct constraint when exactly one class is true. If you apply sigmoid to each logit independently, each output is in (0,1) but they do not sum to 1. A model can output [0.9, 0.8, 0.7] simultaneously — this is mathematically inconsistent for mutually exclusive classes. Fix: always use softmax + CCE for mutually exclusive multiclass.

Mistake 3: Using BCE for multi-label as a single output.

If a sample can belong to multiple classes simultaneously (an image contains both a dog and a cat), a single BCE output cannot represent multiple classes at once. You need one sigmoid + BCE per class, each outputting a binary probability independently. Fix: K output neurons, K sigmoid activations, K BCE losses averaged.

Mistake 4: Missing numerical stability clip in log.

np.log(ŷ) returns −∞ when ŷ = 0. In a training run, a softmax output can be exactly 0.0 for a class if the logit is very negative. This makes the entire cost NaN and the run fails silently or with NaN weights. Fix: always clip: np.log(ŷ + 1e-8) or use framework's built-in cross-entropy which handles this internally.


Framework Defaults

Keras / TensorFlow:

  • BinaryCrossentropy(from_logits=False) — input is already sigmoid output, applies −[y log(ŷ) + ...]
  • BinaryCrossentropy(from_logits=True) — input is raw logit, applies sigmoid internally then BCE. This is numerically more stable because it avoids computing sigmoid then log(sigmoid), which can cause precision loss.
  • CategoricalCrossentropy — CCE with one-hot labels
  • SparseCategoricalCrossentropy — CCE with integer labels

The difference between from_logits=True and from_logits=False matters only for numerical stability, not for the mathematical result. from_logits=True is preferred because it uses the log-sum-exp trick internally to avoid computing exp then log in sequence.

PyTorch:

  • nn.BCELoss() — takes sigmoid output
  • nn.BCEWithLogitsLoss() — takes raw logit (numerically stable, preferred)
  • nn.CrossEntropyLoss() — takes raw logits, applies log-softmax internally (sparse labels)
  • nn.NLLLoss() — takes log-softmax output explicitly

Code

python
import numpy as np

def select_loss(task):
    guide = {
        'regression':        'MSE (or Huber if outliers present)',
        'binary_clf':        'BinaryCrossEntropy + Sigmoid',
        'multiclass':        'CategoricalCrossEntropy + Softmax',
        'multilabel':        'BinaryCrossEntropy per output + Sigmoid per output',
        'imbalanced_binary': 'FocalLoss + Sigmoid',
        'regression_outlier':'Huber or MAE + Linear (no activation)',
    }
    return guide.get(task, 'Unknown task — check task type first')

tasks = list(guide := {
    'regression':        'MSE (or Huber if outliers present)',
    'binary_clf':        'BinaryCrossEntropy + Sigmoid',
    'multiclass':        'CategoricalCrossEntropy + Softmax',
    'multilabel':        'BinaryCrossEntropy per output + Sigmoid per output',
    'imbalanced_binary': 'FocalLoss + Sigmoid',
    'regression_outlier':'Huber or MAE + Linear (no activation)',
})

for task, rec in guide.items():
    print(f"{task:<20}: {rec}")
text
regression          : MSE (or Huber if outliers present)
binary_clf          : BinaryCrossEntropy + Sigmoid
multiclass          : CategoricalCrossEntropy + Softmax
multilabel          : BinaryCrossEntropy per output + Sigmoid per output
imbalanced_binary   : FocalLoss + Sigmoid
regression_outlier  : Huber or MAE + Linear (no activation)

Where this builds from: Regression losses (02) and classification losses (03) defined the individual loss formulas. This post is the decision layer on top of those formulas.

Where this leads: The choice of loss function affects optimizer behavior — specifically, the gradient signal the optimizer receives. Section 5 (optimizers) will use BCE + the churn network as the primary example; the loss choice established here sets up that example. Section 6 (regularization) adds terms to the cost function: the regularization post modifies J, not L — that distinction comes from 01-loss-vs-cost.


Honest Limitations

The table assumes clean task definitions. Real problems often blur categories: self-supervised learning uses reconstruction loss (MSE) but the downstream task is classification. Generative models (VAEs, GANs) use custom loss formulations that don't fit the table. The table covers the 80% case.

Loss function choice affects convergence speed, not just correctness. Even if you choose a theoretically valid loss (e.g., MSE for regression), a poorly scaled MSE (with unnormalized y values in the millions) will produce huge gradients that destabilize training. Loss and data normalization are coupled: normalize y before applying MSE in regression.

Framework from_logits confusion causes silent bugs. If you use BinaryCrossentropy(from_logits=False) but pass raw logits (not sigmoid output), the loss will be computed on wrong values. The model will still train but will converge to wrong weights. Always double-check whether your model's last layer applies sigmoid/softmax before the loss.


Test Your Understanding

  1. A model is trained to predict movie ratings (1–5 stars). One developer proposes MSE. Another proposes CCE with 5 output neurons (one per star). Both are technically valid. Compare the gradients each would produce for a sample where y=4 and the model outputs y_hat=2 (MSE) vs one-hot y=[0,0,0,1,0] and softmax ŷ=[0.1, 0.2, 0.5, 0.15, 0.05] (CCE). Which loss function better captures the ordinal relationship between star ratings?

  2. You train a sentiment classifier (positive/negative) with BCE and achieve 92% accuracy. You then switch to MSE (treating labels as 0 and 1). The accuracy drops to 87% with all other hyperparameters unchanged. Identify two specific reasons why MSE performs worse for this binary classification task, with reference to the gradient computation.

  3. Object detection loss = L_cls + λ × L_reg where L_cls is Focal loss for class prediction and L_reg is smooth-L1 (Huber) for bounding box regression. Why must the bounding box regression use a regression loss rather than a classification loss? What would go wrong if you discretized bounding box coordinates into classes and used CCE?

  4. Keras from_logits=True and from_logits=False produce the same mathematical result but from_logits=True is numerically more stable. Explain the specific numerical issue that arises when computing sigmoid followed by log: why does log(sigmoid(z)) lose precision for large negative z, and how does the log-sum-exp trick avoid it?

  5. A team builds a multi-label classifier for 1000 medical codes (each patient can have multiple diagnoses). They use BCE per output with sigmoid. After training, they observe that the model almost always predicts near-zero probability for all 1000 codes — the "predicting nothing" solution. What loss function phenomenon causes this? What class-weighting or loss modification would fix it?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment