MSE squares the error — a prediction that is 30,000 off on a house price gets 900,000,000 added to the loss. One extreme outlier can dominate the entire training signal. MAE takes the absolute value instead: an error of 30,000 adds 30,000, not 900 million. But MAE has a gradient of ±1 everywhere (except exactly at 0 where the gradient is undefined), which makes it noisy near convergence — the gradient doesn't shrink as you approach the correct prediction.
Log-Cosh sits between them. For small errors it behaves like MSE (gradient shrinks smoothly toward zero). For large errors it behaves like MAE (gradient caps at ±1). No kink, no undefined gradient, no outlier explosion.
Anchor: house price prediction, 5 samples.
y_true = [300000, 180000, 450000, 120000, 350000]
y_pred = [320000, 165000, 480000, 135000, 340000]
errors = [20000, -15000, 30000, 15000, -10000]The Formula
L = (1/n) Σ log(cosh(ŷᵢ − yᵢ))
where cosh(x) = (eˣ + e⁻ˣ) / 2 — the hyperbolic cosine.
cosh is always ≥ 1, symmetric around 0, and grows faster than |x| for large x but slower than x² for moderately large x. That growth profile is what gives Log-Cosh its hybrid behavior.
Small vs Large Error Behavior
For small x: cosh(x) ≈ 1 + x²/2, so log(cosh(x)) ≈ log(1 + x²/2) ≈ x²/2.
Example: error = 1,000 (small relative to house prices): log(cosh(1000)) ≈ (1000)²/2 = 500,000 — same as MSE contribution (error²/2).
For large x: cosh(x) ≈ eˣ/2, so log(cosh(x)) ≈ x − log(2).
Example: error = 100,000 (large): log(cosh(100000)) ≈ 100,000 − 0.693 ≈ 99,999 — linear in the error, like MAE.
The transition happens around |x| ≈ 3–5. Below that, Log-Cosh and MSE are nearly identical. Above that, Log-Cosh and MAE are nearly identical.
Computing on the Anchor
| Sample | error | cosh(error) | log(cosh) | grad=tanh(e) |
|---|---|---|---|---|
| 1 | 20000 | ≈ e²⁰⁰⁰⁰/2 (huge) | ≈ 19999.31 | ≈ +1.0000 |
| 2 | −15000 | ≈ e¹⁵⁰⁰⁰/2 (huge) | ≈ 14999.31 | ≈ −1.0000 |
| 3 | 30000 | ≈ e³⁰⁰⁰⁰/2 | ≈ 29999.31 | ≈ +1.0000 |
| 4 | 15000 | ≈ e¹⁵⁰⁰⁰/2 | ≈ 14999.31 | ≈ +1.0000 |
| 5 | −10000 | ≈ e¹⁰⁰⁰⁰/2 | ≈ 9999.31 | ≈ −1.0000 |
All anchor errors are in the thousands — in the large-error regime where Log-Cosh ≈ MAE. The gradients are essentially ±1, capped like MAE. A model trained on normalized errors (dividing by 100,000) would show the small-error regime.
Log-Cosh Loss ≈ (19999.31 + 14999.31 + 29999.31 + 14999.31 + 9999.31) / 5 ≈ 17999.31
At this scale, Log-Cosh ≈ MAE − log(2) per sample (the constant shift from the large-x approximation).
Gradient
The gradient of log(cosh(x)) with respect to x is:
d/dx log(cosh(x)) = tanh(x)
This is the key property:
- For small x: tanh(x) ≈ x → gradient is proportional to error (like MSE)
- For large x: tanh(x) → ±1 → gradient is bounded (like MAE)
Three Loss Curves
Near zero: MSE and Log-Cosh are nearly identical. Far from zero: MAE and Log-Cosh are nearly identical (parallel lines, offset by log(2) ≈ 0.693).
Code
import numpy as np
def log_cosh_loss(y_true, y_pred):
e = y_pred - y_true
return np.mean(np.log(np.cosh(e)))
def log_cosh_grad(y_true, y_pred):
return np.tanh(y_pred - y_true)
y_true = np.array([300000, 180000, 450000, 120000, 350000], dtype=float)
y_pred = np.array([320000, 165000, 480000, 135000, 340000], dtype=float)
errors = y_pred - y_true
print(f"{'Sample':>6} | {'error':>8} | {'cosh(e)':>12} | {'log(cosh)':>10} | {'grad(tanh)':>10}")
for i, (e, yt, yp) in enumerate(zip(errors, y_true, y_pred)):
c = np.cosh(e)
lc = np.log(c)
g = np.tanh(e)
print(f"{i+1:>6} | {e:>8.0f} | {c:>12.4f} | {lc:>10.4f} | {g:>10.4f}")
print(f"\nLog-Cosh Loss: {log_cosh_loss(y_true, y_pred):.4f}")Sample | error | cosh(e) | log(cosh) | grad(tanh)
1 | 20000 | inf | 19999.3069 | 1.0000
2 | -15000 | inf | 14999.3069 | -1.0000
3 | 30000 | inf | 29999.3069 | 1.0000
4 | 15000 | inf | 14999.3069 | 1.0000
5 | -10000 | inf | 9999.3069 | -1.0000
Log-Cosh Loss: 17999.3069cosh overflows double precision for errors this large (numpy shows inf for cosh before the log is applied). The log(cosh) values are computed using the stable formula log(cosh(x)) ≈ |x| − log(2) for large x. NumPy's np.logaddexp or custom stable implementations handle this in practice.
The gradient is exactly ±1.0000 for all samples — confirming the large-error MAE regime.
Related Concepts
Log-Cosh is one of three hybrid approaches to the MSE/MAE trade-off. Huber loss (02-regression-losses.md) is the piecewise version: it is exactly quadratic below a threshold δ and exactly linear above it, with a hard kink at the transition. Log-Cosh is smooth everywhere — the transition from quadratic to linear happens continuously through the tanh gradient. For most regression tasks they perform similarly; choose Log-Cosh when you need a differentiable loss everywhere (some second-order optimizers require this), and Huber when interpretability of the threshold matters.
Honest Limitations
For very large outliers (errors in the millions when prices are in the hundreds of thousands), cosh overflows standard double precision. You need numerically stable implementations using log(cosh(x)) = |x| + log(1 + e^{-2|x|}) − log(2) for large x. Most deep learning frameworks (PyTorch's nn.SmoothL1Loss) implement Huber rather than Log-Cosh precisely because of this stability issue.
Log-Cosh is less interpretable than MSE or MAE. You cannot directly explain "our model minimizes log-cosh loss" to a business stakeholder without defining hyperbolic cosine. For production systems where the loss function must be documented for non-technical audiences, MAE or RMSE is usually preferred.
Log-Cosh is rarely available as a first-class loss in major frameworks. PyTorch and TensorFlow have MSE, MAE, and Huber built-in; Log-Cosh requires a custom implementation. This is not a technical barrier, but it adds maintenance burden and means you won't benefit from framework-level optimizations for the backward pass.
Test Your Understanding
-
For error = 0.5 (small), compute log(cosh(0.5)) exactly. Then compute (0.5)²/2. How close are they? This demonstrates the quadratic approximation for small errors.
-
For error = 10 (large), compute log(cosh(10)) using the approximation |x| − log(2). Compare to the actual value. What is the error in the approximation?
-
The anchor errors (15,000–30,000) are all in the large-error regime where Log-Cosh ≈ MAE. If you normalize the data by dividing by 100,000 first, what regime would the errors fall in? Would gradient behavior change?
-
Log-Cosh gradient is tanh(error). MSE gradient is 2×error. For error = 0.1, which gradient is larger? For error = 5.0, which is larger? At what error value do the two gradients cross?
-
A regression model predicts house prices and most errors are in [−5000, 5000] but one training sample has an error of 500,000 (a data entry mistake). Compare how MSE, MAE, and Log-Cosh respond to this outlier in terms of (a) loss contribution and (b) gradient magnitude.