~/blog

Quantile Loss (Pinball Loss)

Jul 3, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

MSE and MAE both optimize toward the conditional mean — they produce predictions that are, on average, the expected value of the target. But there are many problems where the mean is the wrong target entirely.

An insurance company pricing policies needs to predict the 90th percentile of claim cost, not the average. Setting premiums at the average means losing money on 50% of policies. An inventory manager stocking warehouses needs to predict the 80th percentile of demand — stocking at average demand causes stockouts half the time. A hospital scheduling staff needs to predict the 95th percentile of patient arrivals — understaffing is catastrophic even if rare.

Quantile loss (also called pinball loss) makes the target percentile a parameter of the loss function. Set q=0.5 and training converges to the median. Set q=0.9 and training converges to the 90th percentile. The formula is asymmetric: underprediction and overprediction are penalized at different rates depending on q.

Anchor: house price prediction, 5 samples.

python
y_true = [300000, 180000, 450000, 120000, 350000]
y_pred = [320000, 165000, 480000, 135000, 340000]
error (y−ŷ) = [-20000, +15000, -30000, -15000, +10000]

Note: error here is y_true − y_pred. Positive = underprediction (we guessed too low). Negative = overprediction (we guessed too high).


The Formula

L_q(y, ŷ) = q · max(y − ŷ, 0) + (1 − q) · max(ŷ − y, 0)

Equivalently as piecewise:

  • If y > ŷ (underprediction): loss = q · (y − ŷ)
  • If y < ŷ (overprediction): loss = (1 − q) · (ŷ − y)
  • If y = ŷ: loss = 0

At q=0.5: both branches have coefficient 0.5. Underpredicting by 10,000 costs the same as overpredicting by 10,000. The minimizer of the expected loss under this symmetric penalty is the median — not the mean.

At q=0.9: underpredicting by 10,000 costs 0.9 × 10,000 = 9,000. Overpredicting by 10,000 costs only 0.1 × 10,000 = 1,000. The model is penalized 9× more for being too low than for being too high. This pushes predictions up until only 10% of the true values exceed the prediction — the 90th percentile.


Trace Table at q=0.5

Sampley_truey_prederror (y−ŷ)branchloss
1300000320000−20000overprediction0.5 × 20000 = 10000
2180000165000+15000underprediction0.5 × 15000 = 7500
3450000480000−30000overprediction0.5 × 30000 = 15000
4120000135000−15000overprediction0.5 × 15000 = 7500
5350000340000+10000underprediction0.5 × 10000 = 5000

Total loss at q=0.5: (10000 + 7500 + 15000 + 7500 + 5000) / 5 = 45000 / 5 = 9000


Trace Table at q=0.9

Sampleerror (y−ŷ)branchloss
1−20000overprediction(1−0.9) × 20000 = 0.1 × 20000 = 2000
2+15000underprediction0.9 × 15000 = 13500
3−30000overprediction0.1 × 30000 = 3000
4−15000overprediction0.1 × 15000 = 1500
5+10000underprediction0.9 × 10000 = 9000

Total loss at q=0.9: (2000 + 13500 + 3000 + 1500 + 9000) / 5 = 29000 / 5 = 5800

Samples 2 and 5 (underpredictions) now dominate the loss entirely — 13500 and 9000 respectively. The same underpredictions contributed only 7500 and 5000 at q=0.5. The model will be strongly pulled toward higher predictions to reduce these large underprediction penalties.

Total loss at q=0.9 (5800) is lower than at q=0.5 (9000) on this anchor because our predictions already tend to overpredict — and q=0.9 forgives overprediction heavily.


Asymmetric V Shape

Quantile Loss — Asymmetric Penalty y=ŷ (0) over-predict under-predict 0 high slope = 0.5 slope = 0.5 q = 0.5 (median) slope = 0.1 slope = 0.9 q = 0.9 underprediction heavily penalized

The q=0.5 V is symmetric — same slope on both arms. The q=0.9 V is tilted strongly right: the left arm (overprediction) is nearly flat (slope 0.1) while the right arm (underprediction) is steep (slope 0.9). The model's optimal prediction shifts rightward — toward higher values — to avoid the steep penalty side.


Gradient

The gradient is piecewise constant:

  • If y > ŷ (underprediction): gradient = −q (push ŷ up)
  • If y < ŷ (overprediction): gradient = (1−q) (push ŷ down)
  • If y = ŷ: subgradient = 0

Gradients at q=0.9 for anchor:

Sampleerrordirectiongradient
1−20000overprediction+(1−0.9) = +0.1
2+15000underprediction−0.9 = −0.9
3−30000overprediction+0.1
4−15000overprediction+0.1
5+10000underprediction−0.9

The underprediction gradients (−0.9) push predictions up. The overprediction gradients (+0.1) push predictions down weakly. On average: (0.1 + −0.9 + 0.1 + 0.1 + −0.9) / 5 = −1.5/5 = −0.3, a net push upward on predictions.


Code

python
import numpy as np

def quantile_loss(y_true, y_pred, q):
    e = y_true - y_pred
    return np.mean(np.maximum(q * e, (q - 1) * e))

def quantile_loss_per_sample(y_true, y_pred, q):
    losses = []
    for yt, yp in zip(y_true, y_pred):
        e = yt - yp
        loss = q * e if e >= 0 else (q - 1) * e
        losses.append(loss)
    return losses

y_true = np.array([300000, 180000, 450000, 120000, 350000], dtype=float)
y_pred = np.array([320000, 165000, 480000, 135000, 340000], dtype=float)

for q in [0.5, 0.9]:
    per_sample = quantile_loss_per_sample(y_true, y_pred, q)
    total = quantile_loss(y_true, y_pred, q)
    print(f"\nq={q}:")
    print(f"{'Sample':>6} | {'y_true':>8} | {'y_pred':>8} | {'error':>8} | {'loss':>10}")
    for i, (yt, yp, l) in enumerate(zip(y_true, y_pred, per_sample)):
        print(f"{i+1:>6} | {yt:>8.0f} | {yp:>8.0f} | {yt-yp:>8.0f} | {l:>10.1f}")
    print(f"  Total quantile loss (q={q}): {total:.1f}")
text
q=0.5:
Sample |   y_true |   y_pred |    error |       loss
     1 |   300000 |   320000 |   -20000 |    10000.0
     2 |   180000 |   165000 |    15000 |     7500.0
     3 |   450000 |   480000 |   -30000 |    15000.0
     4 |   120000 |   135000 |   -15000 |     7500.0
     5 |   350000 |   340000 |    10000 |     5000.0
  Total quantile loss (q=0.5): 9000.0

q=0.9:
Sample |   y_true |   y_pred |    error |       loss
     1 |   300000 |   320000 |   -20000 |     2000.0
     2 |   180000 |   165000 |    15000 |    13500.0
     3 |   450000 |   480000 |   -30000 |     3000.0
     4 |   120000 |   135000 |   -15000 |     1500.0
     5 |   350000 |   340000 |    10000 |     9000.0
  Total quantile loss (q=0.9): 5800.0

At q=0.5, quantile loss reduces to MAE (02-regression-losses.md): both arms have slope 0.5, which is just 0.5×|error|, and the mean is 0.5×MAE. The minimizer of MAE is the median — quantile loss at q=0.5 makes this explicit. For higher q, quantile regression forests and gradient boosted trees (XGBoost, LightGBM) natively support quantile objectives. Multiple quantile predictions (q=0.1, 0.5, 0.9) give prediction intervals without distributional assumptions — this is the basis of conformal prediction intervals.

Honest Limitations

q must be chosen before training. If you need the 90th percentile but set q=0.8, you train a model that converges to the 80th percentile. There is no feedback loop during training that tells you whether you chose the right q for your business need. Calibration must be verified post-hoc.

The gradient is discontinuous at y=ŷ (zero error). Standard gradient descent handles this via subgradient (set gradient to 0 at zero), but this makes quantile regression slightly more sensitive to initialization and learning rate than smooth losses. For very noisy data where many predictions sit near zero error, the training signal can be weak.

Quantile loss alone does not produce calibrated prediction intervals. Predicting both q=0.1 and q=0.9 does not guarantee that the true value falls between them 80% of the time — the two models may be miscalibrated in the same direction. Achieving calibration requires conformal prediction or explicit interval calibration on a held-out set.


Test Your Understanding

  1. At q=0.5, the total quantile loss for the anchor is 9000. Verify that this equals 0.5 × MAE of the anchor predictions. Why does this relationship hold exactly?

  2. A demand forecasting model needs to minimize stockouts (which happen when actual demand exceeds prediction). Should you use q=0.3 or q=0.8? Explain in terms of which side of the loss function gets the steeper slope.

  3. What happens to quantile loss as q → 0? What prediction does the model converge to, and what is the gradient for every underprediction in that limit?

  4. The anchor at q=0.9 shows total loss 5800, which is lower than q=0.5 (9000). Does this mean q=0.9 is a "better" loss function for this dataset? What would the correct interpretation be?

  5. You train a model with q=0.9 on a training set where 90% of predictions are overpredictions (you started with a very high initial bias). How does the gradient update behave in this scenario? Does it self-correct?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment