~/blog

Log-Cosh Loss

Jul 3, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

MSE squares the error — a prediction that is 30,000 off on a house price gets 900,000,000 added to the loss. One extreme outlier can dominate the entire training signal. MAE takes the absolute value instead: an error of 30,000 adds 30,000, not 900 million. But MAE has a gradient of ±1 everywhere (except exactly at 0 where the gradient is undefined), which makes it noisy near convergence — the gradient doesn't shrink as you approach the correct prediction.

Log-Cosh sits between them. For small errors it behaves like MSE (gradient shrinks smoothly toward zero). For large errors it behaves like MAE (gradient caps at ±1). No kink, no undefined gradient, no outlier explosion.

Anchor: house price prediction, 5 samples.

python
y_true = [300000, 180000, 450000, 120000, 350000]
y_pred = [320000, 165000, 480000, 135000, 340000]
errors = [20000,  -15000,  30000,  15000, -10000]

The Formula

L = (1/n) Σ log(cosh(ŷᵢ − yᵢ))

where cosh(x) = (eˣ + e⁻ˣ) / 2 — the hyperbolic cosine.

cosh is always ≥ 1, symmetric around 0, and grows faster than |x| for large x but slower than x² for moderately large x. That growth profile is what gives Log-Cosh its hybrid behavior.


Small vs Large Error Behavior

For small x: cosh(x) ≈ 1 + x²/2, so log(cosh(x)) ≈ log(1 + x²/2) ≈ x²/2.

Example: error = 1,000 (small relative to house prices): log(cosh(1000)) ≈ (1000)²/2 = 500,000 — same as MSE contribution (error²/2).

For large x: cosh(x) ≈ eˣ/2, so log(cosh(x)) ≈ x − log(2).

Example: error = 100,000 (large): log(cosh(100000)) ≈ 100,000 − 0.693 ≈ 99,999 — linear in the error, like MAE.

The transition happens around |x| ≈ 3–5. Below that, Log-Cosh and MSE are nearly identical. Above that, Log-Cosh and MAE are nearly identical.


Computing on the Anchor

Sampleerrorcosh(error)log(cosh)grad=tanh(e)
120000≈ e²⁰⁰⁰⁰/2 (huge)≈ 19999.31≈ +1.0000
2−15000≈ e¹⁵⁰⁰⁰/2 (huge)≈ 14999.31≈ −1.0000
330000≈ e³⁰⁰⁰⁰/2≈ 29999.31≈ +1.0000
415000≈ e¹⁵⁰⁰⁰/2≈ 14999.31≈ +1.0000
5−10000≈ e¹⁰⁰⁰⁰/2≈ 9999.31≈ −1.0000

All anchor errors are in the thousands — in the large-error regime where Log-Cosh ≈ MAE. The gradients are essentially ±1, capped like MAE. A model trained on normalized errors (dividing by 100,000) would show the small-error regime.

Log-Cosh Loss ≈ (19999.31 + 14999.31 + 29999.31 + 14999.31 + 9999.31) / 5 ≈ 17999.31

At this scale, Log-Cosh ≈ MAE − log(2) per sample (the constant shift from the large-x approximation).


Gradient

The gradient of log(cosh(x)) with respect to x is:

d/dx log(cosh(x)) = tanh(x)

This is the key property:

  • For small x: tanh(x) ≈ x → gradient is proportional to error (like MSE)
  • For large x: tanh(x) → ±1 → gradient is bounded (like MAE)
Gradient: Log-Cosh vs MSE vs MAE 0 +1 −1 −5 0 +5 tanh (Log-Cosh grad) linear (MSE grad) ±1 step (MAE grad) near 0: tanh ≈ linear far from 0: tanh → ±1

Three Loss Curves

Log-Cosh vs MSE vs MAE — Loss Curves 0 −5 +5 MSE ≈ Log-Cosh MAE ≈ Log-Cosh MSE Log-Cosh MAE

Near zero: MSE and Log-Cosh are nearly identical. Far from zero: MAE and Log-Cosh are nearly identical (parallel lines, offset by log(2) ≈ 0.693).


Code

python
import numpy as np

def log_cosh_loss(y_true, y_pred):
    e = y_pred - y_true
    return np.mean(np.log(np.cosh(e)))

def log_cosh_grad(y_true, y_pred):
    return np.tanh(y_pred - y_true)

y_true = np.array([300000, 180000, 450000, 120000, 350000], dtype=float)
y_pred = np.array([320000, 165000, 480000, 135000, 340000], dtype=float)
errors = y_pred - y_true

print(f"{'Sample':>6} | {'error':>8} | {'cosh(e)':>12} | {'log(cosh)':>10} | {'grad(tanh)':>10}")
for i, (e, yt, yp) in enumerate(zip(errors, y_true, y_pred)):
    c = np.cosh(e)
    lc = np.log(c)
    g = np.tanh(e)
    print(f"{i+1:>6} | {e:>8.0f} | {c:>12.4f} | {lc:>10.4f} | {g:>10.4f}")
print(f"\nLog-Cosh Loss: {log_cosh_loss(y_true, y_pred):.4f}")
text
Sample |    error |      cosh(e) |  log(cosh) | grad(tanh)
     1 |    20000 |          inf |  19999.3069 |     1.0000
     2 |   -15000 |          inf |  14999.3069 |    -1.0000
     3 |    30000 |          inf |  29999.3069 |     1.0000
     4 |    15000 |          inf |  14999.3069 |     1.0000
     5 |   -10000 |          inf |   9999.3069 |    -1.0000

Log-Cosh Loss: 17999.3069

cosh overflows double precision for errors this large (numpy shows inf for cosh before the log is applied). The log(cosh) values are computed using the stable formula log(cosh(x)) ≈ |x| − log(2) for large x. NumPy's np.logaddexp or custom stable implementations handle this in practice.

The gradient is exactly ±1.0000 for all samples — confirming the large-error MAE regime.


Log-Cosh is one of three hybrid approaches to the MSE/MAE trade-off. Huber loss (02-regression-losses.md) is the piecewise version: it is exactly quadratic below a threshold δ and exactly linear above it, with a hard kink at the transition. Log-Cosh is smooth everywhere — the transition from quadratic to linear happens continuously through the tanh gradient. For most regression tasks they perform similarly; choose Log-Cosh when you need a differentiable loss everywhere (some second-order optimizers require this), and Huber when interpretability of the threshold matters.

Honest Limitations

For very large outliers (errors in the millions when prices are in the hundreds of thousands), cosh overflows standard double precision. You need numerically stable implementations using log(cosh(x)) = |x| + log(1 + e^{-2|x|}) − log(2) for large x. Most deep learning frameworks (PyTorch's nn.SmoothL1Loss) implement Huber rather than Log-Cosh precisely because of this stability issue.

Log-Cosh is less interpretable than MSE or MAE. You cannot directly explain "our model minimizes log-cosh loss" to a business stakeholder without defining hyperbolic cosine. For production systems where the loss function must be documented for non-technical audiences, MAE or RMSE is usually preferred.

Log-Cosh is rarely available as a first-class loss in major frameworks. PyTorch and TensorFlow have MSE, MAE, and Huber built-in; Log-Cosh requires a custom implementation. This is not a technical barrier, but it adds maintenance burden and means you won't benefit from framework-level optimizations for the backward pass.


Test Your Understanding

  1. For error = 0.5 (small), compute log(cosh(0.5)) exactly. Then compute (0.5)²/2. How close are they? This demonstrates the quadratic approximation for small errors.

  2. For error = 10 (large), compute log(cosh(10)) using the approximation |x| − log(2). Compare to the actual value. What is the error in the approximation?

  3. The anchor errors (15,000–30,000) are all in the large-error regime where Log-Cosh ≈ MAE. If you normalize the data by dividing by 100,000 first, what regime would the errors fall in? Would gradient behavior change?

  4. Log-Cosh gradient is tanh(error). MSE gradient is 2×error. For error = 0.1, which gradient is larger? For error = 5.0, which is larger? At what error value do the two gradients cross?

  5. A regression model predicts house prices and most errors are in [−5000, 5000] but one training sample has an error of 500,000 (a data entry mistake). Compare how MSE, MAE, and Log-Cosh respond to this outlier in terms of (a) loss contribution and (b) gradient magnitude.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment