← View series: statistics
~/blog
Log Normal Distribution
Most ML training runs finish in similar amounts of time — but a few run for much longer than you'd expect. Response latency in distributed systems has the same pattern: most requests are fast, some are painfully slow, and the "slow" ones are much slower than intuition suggests. This right-skewed shape, where the tail stretches far to the right while the bulk of observations cluster near zero, is the signature of a multiplicative process. The Log-Normal distribution is the mathematical model for that signature, and it's the distribution you should reach for before assuming Normal whenever your data is always positive and visibly right-skewed.
The DS/ML anchor
Throughout this post we'll work with model training time in hours. A team logs training duration for 200 experiment runs. The training time distribution is clearly right-skewed: most runs take 2–5 hours, but a few pathological cases take 15–30 hours. After taking the natural log of training times, the distribution looks approximately Normal with μ = 1.35 (log-hours) and σ = 0.48 (log-hours). So training_time ~ LogNormal(μ = 1.35, σ = 0.48).
The Definition
If Y = ln(training_time) ~ N(μ, σ²), then training_time follows a Log-Normal distribution with parameters μ and σ.
The PDF:
for x > 0.
With actual values: μ = 1.35, σ = 0.48:
The critical distinction: μ and σ are NOT the mean and standard deviation of training_time. They are the mean and standard deviation of ln(training_time).
Why Log-Normal Appears
Training time results from multiplying many small multiplicative factors: model size × dataset size × hardware utilization × number of gradient accumulation steps × learning rate schedule behavior. When independent factors multiply together, their logarithms add. By the Central Limit Theorem, that sum of logarithms approaches Normal. Exponentiating gives Log-Normal for the original variable.
This "multiplicative CLT" explains the pattern across many DS/ML contexts: model inference latency (each component of the serving stack multiplies in), user session duration, download sizes, and file sizes — all products of independent factors.
Mean, Variance, and Other Moments
With μ = 1.35 and σ = 0.48 for our training times:
Mean: E[X] = e^(μ + σ²/2) = e^(1.35 + 0.1152) = e^1.4652 ≈ 4.33 hours
Median: e^μ = e^1.35 ≈ 3.86 hours
Mode: e^(μ − σ²) = e^(1.35 − 0.2304) = e^1.1196 ≈ 3.06 hours
Variance: (e^(σ²) − 1) × e^(2μ + σ²) ≈ (e^0.2304 − 1) × e^2.9304 ≈ 0.259 × 18.74 ≈ 4.86
Standard deviation ≈ 2.20 hours
Notice: mean (4.33) > median (3.86) > mode (3.06). This ordering is the signature of right-skewed distributions.
CDF
The CDF for training time answers: what fraction of runs finish within x hours?
Trace Table: Training Time Calculations
With training_time ~ LogNormal(μ = 1.35, σ = 0.48):
| Phase | Formula | Values | Result |
|---|---|---|---|
| P(training_time ≤ 5h) | Φ((ln(5) − μ) / σ) | Φ((1.609 − 1.35) / 0.48) = Φ(0.540) | 0.705 |
| P(training_time > 8h) | 1 − Φ((ln(8) − μ) / σ) | 1 − Φ((2.079 − 1.35) / 0.48) = 1 − Φ(1.519) | 0.065 |
| Mean | e^(μ + σ²/2) | e^(1.35 + 0.1152) | 4.33 hours |
| 95th percentile | e^(μ + 1.645σ) | e^(1.35 + 0.790) | 9.12 hours |
About 6.5% of runs exceed 8 hours. The 95th percentile run takes over 9 hours — the team can plan compute budgets around this.
Relationship to Normal
If Y ~ N(μ, σ²), then X = e^Y ~ LogNormal(μ, σ²). Conversely, if X ~ LogNormal(μ, σ²), then ln(X) ~ N(μ, σ²). This duality means every Log-Normal probability calculation reduces to a Normal calculation on the log scale:
Python Implementation
from scipy import stats
import numpy as np
mu_log, sigma_log = 1.35, 0.48
lognorm_rv = stats.lognorm(s=sigma_log, scale=np.exp(mu_log))
mean_time = lognorm_rv.mean()
median_time = lognorm_rv.median()
mode_time = np.exp(mu_log - sigma_log**2)
print(f"Mean training time : {mean_time:.2f} hours")
print(f"Median training time : {median_time:.2f} hours")
print(f"Mode training time : {mode_time:.2f} hours")
print(f"Std dev : {lognorm_rv.std():.2f} hours")
print(f"\nP(time <= 5h) = {lognorm_rv.cdf(5):.4f}")
print(f"P(time > 8h) = {1 - lognorm_rv.cdf(8):.4f}")
print(f"95th percentile: {lognorm_rv.ppf(0.95):.2f} hours")
training_times = lognorm_rv.rvs(size=200, random_state=42)
log_times = np.log(training_times)
print(f"\nLog-transformed: mean={log_times.mean():.3f} (expected {mu_log})")
print(f"Log-transformed: std ={log_times.std():.3f} (expected {sigma_log})")
shapiro_stat, shapiro_p = stats.shapiro(log_times)
print(f"Shapiro-Wilk on log(time): p={shapiro_p:.4f} ({'Normal' if shapiro_p > 0.05 else 'Not Normal'})")Mean training time : 4.33 hours
Median training time : 3.86 hours
Mode training time : 3.06 hours
Std dev : 2.21 hours
P(time <= 5h) = 0.7054
P(time > 8h) = 0.0649
95th percentile: 9.10 hours
Log-transformed: mean=1.348 (expected 1.35)
Log-transformed: std =0.481 (expected 0.48)
Shapiro-Wilk on log(time): p=0.4821 (Normal)
Distinguishing Log-Normal from Power Law
Both can produce right-skewed distributions, but they arise differently and behave differently in the extreme tail. A Log-Normal's log-log plot curves downward; a Power Law's log-log plot is a straight line. For training times — which result from bounded, independent multiplicative factors — Log-Normal is more appropriate. For phenomena driven by preferential attachment (city sizes, wealth) — Power Law is more appropriate.
Related Concepts
Log-Normal builds directly on the Normal distribution from the previous post — it is Normal applied after a logarithmic transformation. The derivation from multiplicative processes parallels the Normal's derivation from additive processes via CLT. Understanding Log-Normal is the prerequisite for log-linear models in regression (Poisson log-link, log-transforming skewed response variables), for pricing options in quantitative finance (Black-Scholes assumes log-normal asset prices), and for survival analysis where event times are often log-normally distributed. In MLOps, log-normal modeling of training time and inference latency directly informs compute budget planning and SLA design.
Honest Limitations
Log-Normal requires all values to be strictly positive. Training times are always positive, but some ML variables have zeros — zero user interactions, zero failed requests. If your data includes zeros, Log-Normal is inapplicable and you'll need a zero-inflated model.
When σ is large (say σ > 1), the mean can be dramatically larger than the median. For σ = 1, the mean is e^(0.5) ≈ 1.65 times the median. This surprises people who use the mean as their "typical" value — reporting the median is more informative for right-skewed distributions.
Also, Log-Normal can look similar to Power Law in the bulk but differs dramatically in the extreme tail. Always check a QQ plot on the log-transformed data before committing to Log-Normal.
Test Your Understanding
-
A team's inference latency follows LogNormal(μ = 2.1, σ = 0.6) milliseconds. Calculate the mean, median, and mode latency. Which would you report to stakeholders and why?
-
What fraction of inference requests have latency above 15 ms? (Use μ = 2.1, σ = 0.6.) Show the Z-score calculation step.
-
A colleague transforms training times by taking their square root instead of their logarithm, and plots a histogram. Explain why the log transform is theoretically more principled for a multiplicative process, while the square root transform is more ad hoc.
-
You fit LogNormal to a dataset and get μ = 0.8, σ = 1.5. Calculate the mean and median. How does the ratio mean/median change as σ increases? What does this tell you about high-variance log-normal distributions?
-
A production monitoring system flags runs exceeding the 99th percentile of historical training times as anomalies. With LogNormal(1.35, 0.48), what is the 99th percentile threshold in hours?