← View series: statistics
~/blog
Skewness and Kurtosis
Mean and standard deviation describe only two properties of a distribution — center and spread. They say nothing about symmetry or tail weight. Two distributions can share the same mean and SD yet look completely different. Skewness and kurtosis fill that gap.
Anchors used throughout this post:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] # six CV fold scores
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890] # model inference latency (ms)Both anchors have roughly the same ballpark, but their shapes differ dramatically — the accuracy data is near-symmetric, the latency data is right-skewed with a long tail from the 890 ms spike.
Why Mean ± SD Is Not Enough
Consider two sets of six model performance scores. Both have μ = 0.838, σ = 0.048:
- Set A: [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] — values cluster evenly around the mean
- Set B: [0.80, 0.80, 0.81, 0.82, 0.84, 0.97] — most values are below the mean; one large outlier pulls it up
Which model would you prefer in production? Set A is more predictable. Set B's mean is the same, but you cannot trust it — there is hidden asymmetry. That asymmetry is skewness. The heavier tail risk is kurtosis.
SVG below shows three distributions, all centered at the same mean:
The Four Moments
Skewness and kurtosis are the 3rd and 4th standardized moments of a distribution. Every moment captures a different property:
| Moment | Name | Formula | What It Measures |
|---|---|---|---|
| 1st | Mean | μ = E[X] | Center |
| 2nd | Variance | σ² = E[(X−μ)²] | Spread |
| 3rd | Skewness | γ₁ = E[(X−μ)³] / σ³ | Asymmetry |
| 4th | Kurtosis | γ₂ = E[(X−μ)⁴] / σ⁴ | Tail weight |
Why cube for skewness: Cubing preserves the sign of the deviation. A positive deviation cubed stays positive; a negative one stays negative. A distribution with a heavier right tail produces many large positive cubed deviations that are not cancelled by the smaller negative ones → net positive skewness.
Why 4th power for kurtosis: The 4th power amplifies large deviations enormously — an outlier at 3σ contributes 81 times more than a point at σ. A distribution with heavy tails (many large deviations) accumulates a large 4th moment → high kurtosis.
Skewness
Sample skewness formula:
γ₁ = [n / ((n−1)(n−2))] × Σ[(xᵢ − x̄)/s]³
The correction factor n/((n−1)(n−2)) adjusts for small-sample bias in the same spirit as Bessel's correction.
Phase 1 — Compute x̄ and s
Using the accuracy anchor:
x̄ = (0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.88) / 6 = 5.03 / 6 = 0.8383
s = √[Σ(xᵢ − x̄)² / (n−1)] = √(0.013083 / 5) = 0.0512
Phase 2 — Standardize each value
Compute (xᵢ − x̄) / s for each fold:
| Fold | xᵢ | xᵢ − x̄ | (xᵢ − x̄)/s | [(xᵢ − x̄)/s]³ |
|---|---|---|---|---|
| 1 | 0.82 | −0.0183 | −0.358 | −0.0459 |
| 2 | 0.79 | −0.0483 | −0.945 | −0.8437 |
| 3 | 0.91 | +0.0717 | +1.401 | +2.7517 |
| 4 | 0.85 | +0.0117 | +0.228 | +0.0119 |
| 5 | 0.78 | −0.0583 | −1.140 | −1.4834 |
| 6 | 0.88 | +0.0417 | +0.815 | +0.5410 |
| Sum | 0.9316 |
Phase 3 — Apply correction factor
n / ((n−1)(n−2)) = 6 / (5 × 4) = 6/20 = 0.30
γ₁ = 0.30 × 0.9316 = 0.2795 ≈ 0.28
The SVG below shows both datasets on a number line. For accuracy, values cluster nearly symmetrically around the mean. For latency, values pile up on the left with a long right tail:
Notice that in the latency data, the mean (261) is pulled far right of the median (151) by the 890 ms spike. In right-skewed data, mean > median.
Interpretation Scale
| |γ₁| | Label | Example | |--------|-------|---------| | < 0.5 | Approximately symmetric | CV accuracy scores | | 0.5 – 1.0 | Moderately skewed | Daily website traffic | | ≥ 1.0 | Highly skewed | Model latency, income, error counts |
Positive skewness (right-skewed): long tail to the right, mean > median > mode.
Negative skewness (left-skewed): long tail to the left, mean < median < mode.
Examples of left-skewed data: exam scores on an easy test, survival times after effective treatment.
The Mean-Median Rule — and Its Limits
"In right-skewed data, mean > median" is often stated as if it were a theorem. It is not. It is an approximation (Pearson's rule) that holds for common unimodal distributions. Counterexamples exist: a distribution can have positive skewness but mean < median if the peak is asymmetric in the other direction. The correct statement is: skewness measures the asymmetry of the distribution via signed cubed deviations, not the mean-median gap.
When Skewness Matters for ML
- Log-normal features (latency, income): apply a log transform before modeling to reduce right skew and make the distribution closer to normal, which helps linear models and distance-based methods.
- Right-skewed targets: MSE loss squares errors, so the large squared errors from the right tail dominate the gradient. Huber loss or MAE is more robust when the target is skewed.
- Skewed features in linear models: high skewness distorts coefficient estimates and feature importance. PCA is sensitive to skewness because the variance calculation is dominated by the stretched tail.
Kurtosis and Excess Kurtosis
Kurtosis measures how much probability mass sits in the tails relative to a normal distribution with the same mean and variance.
Sample kurtosis formula:
γ₂ = [n(n+1) / ((n−1)(n−2)(n−3))] × Σ[(xᵢ − x̄)/s]⁴ − 3(n−1)² / ((n−2)(n−3))
The −3 term: the normal distribution has a raw kurtosis of exactly 3. Subtracting 3 gives excess kurtosis, so the normal has excess kurtosis = 0. This is the convention used by scipy, pandas, and most statistical software. Always confirm which convention a library uses before interpreting the number.
Three Categories
| Category | Excess Kurtosis | Tails | Shape | Example |
|---|---|---|---|---|
| Mesokurtic | ≈ 0 | Normal weight | Standard bell curve | Normal distribution |
| Leptokurtic | > 0 | Heavier than normal | Tall peak, fat tails | Financial returns, t-distribution |
| Platykurtic | < 0 | Lighter than normal | Flatter peak, thin tails | Uniform distribution |
Kurtosis ≠ Peakedness
The common description — "kurtosis measures how peaked a distribution is" — is wrong. Kurtosis measures tail weight. A leptokurtic distribution can have a lower peak than the normal distribution while still having heavier tails and higher kurtosis. The excess probability mass moves from the shoulders of the distribution into the tails; the peak height is a byproduct, not the cause. Describing kurtosis as peakedness leads to misinterpretation — you should watch the tails, not the top.
When Kurtosis Matters for ML
- Heavy tails (leptokurtic) during training: gradient updates from outlier batches can be explosive. Loss distributions during training instability are often leptokurtic. Gradient clipping addresses this.
- t-distribution: leptokurtic with excess kurtosis = 6 / (df − 4) for df > 4. As df → ∞, it converges to the normal (excess kurtosis → 0). Small-sample t-tests rely on this.
- Normality tests: Shapiro-Wilk and Jarque-Bera both check skewness and kurtosis jointly to detect departures from normality.
Jarque-Bera Normality Test
Both skewness and kurtosis are combined into a single normality test:
JB = (n/6) × [γ₁² + (γ₂/2)²]
Under H₀ (data is normally distributed), skewness = 0 and excess kurtosis = 0, so JB ≈ 0. JB follows a chi-square distribution with 2 degrees of freedom asymptotically. A large JB → small p-value → reject normality.
Applied to both anchors:
from scipy import stats
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]
jb_acc, p_acc = stats.jarque_bera(accuracy)
jb_lat, p_lat = stats.jarque_bera(latency_ms)
print(f"accuracy: JB={jb_acc:.4f}, p={p_acc:.4f}")
print(f"latency_ms: JB={jb_lat:.4f}, p={p_lat:.4f}")
# skewness and excess kurtosis separately
print(f"\naccuracy skewness={stats.skew(accuracy):.4f}, "
f"kurtosis(excess)={stats.kurtosis(accuracy):.4f}")
print(f"latency_ms skewness={stats.skew(latency_ms):.4f}, "
f"kurtosis(excess)={stats.kurtosis(latency_ms):.4f}")accuracy: JB=0.6252, p=0.7316
latency_ms: JB=23.5641, p=0.0000
accuracy skewness=0.2795, kurtosis(excess)=-1.4832
latency_ms skewness=2.5601, kurtosis(excess)=6.7248
Interpret:
accuracy: JB=0.63, p=0.73 — do not reject normality. Skewness (0.28) is well within the symmetric range; the negative excess kurtosis (−1.48) reflects fewer tail events than a normal, which makes sense for six bounded CV scores.latency_ms: JB=23.56, p≈0 — reject normality strongly. Skewness=2.56 (highly right-skewed), excess kurtosis=6.72 (very heavy tails driven by the 890 ms spike). Apply a log transform or use nonparametric methods before any normality-assuming analysis.
Note: scipy's stats.kurtosis() returns excess kurtosis (Fisher's definition, subtracting 3). Pandas .kurt() also returns excess kurtosis by default.
Computing Skewness for the Latency Anchor
For completeness, here is the latency skewness breakdown:
x̄_lat = (120+135+142+148+155+189+312+890) / 8 = 2091 / 8 = 261.4
s_lat = √(Σ(xᵢ − 261.4)² / 7) = 261.1
median = (148 + 155) / 2 = 151.5
| xᵢ | xᵢ − x̄ | (xᵢ − x̄)/s | [(xᵢ − x̄)/s]³ |
|---|---|---|---|
| 120 | −141.4 | −0.541 | −0.1585 |
| 135 | −126.4 | −0.484 | −0.1134 |
| 142 | −119.4 | −0.457 | −0.0954 |
| 148 | −113.4 | −0.434 | −0.0818 |
| 155 | −106.4 | −0.407 | −0.0676 |
| 189 | −72.4 | −0.277 | −0.0212 |
| 312 | +50.6 | +0.194 | +0.0073 |
| 890 | +628.6 | +2.407 | +13.963 |
| Sum | 13.440 |
Correction = 8 / (7 × 6) = 0.1905
γ₁ = 0.1905 × 13.440 = 2.56
The 890 ms entry alone contributes 13.963 / 13.440 ≈ 104% of the total skewness. Remove it and the distribution becomes approximately symmetric.
Practical Workflow
When you receive a new dataset, follow this sequence before choosing any statistical method:
The workflow: plot first (skewness is a confirmation, not a substitute for visual inspection) → quantify asymmetry → quantify tail weight → run a formal normality test → choose the right method.
Related Concepts and Honest Limitations
Related: The moments table connects directly to the moment-generating function used in probability theory. Skewness is central to understanding the central limit theorem — it tells you how many samples you need before the sample mean distribution is approximately normal (more skewness → more samples needed). Kurtosis connects to value-at-risk in finance and gradient stability in deep learning.
Limitations:
- With small samples (n < 30), skewness and kurtosis estimates have high variance. The Jarque-Bera test is asymptotic and unreliable for small n — prefer Shapiro-Wilk when n < 50.
- Skewness and kurtosis describe the marginal distribution of one variable. They say nothing about multivariate structure, dependencies, or conditional distributions.
- A skewness of 0 does not guarantee symmetry — a distribution can be asymmetric but have cubed deviations that cancel. Similarly, zero kurtosis does not mean the distribution is normal.
Test Your Understanding
-
You have model latency data with x̄ = 200 ms and median = 140 ms. Without computing skewness, what sign do you expect it to have, and why?
-
You remove the 890 ms outlier from the latency data, leaving seven values. How would you expect the skewness and excess kurtosis to change? Would the Jarque-Bera test still reject normality?
-
Two distributions have γ₁ = 0.0 (zero skewness). Does this guarantee they are identical in shape? Construct a counterexample or explain why not.
-
A colleague says: "This loss distribution has very high kurtosis, so it has a very sharp peak." What is wrong with this statement, and what does high kurtosis actually tell you about the loss distribution?
-
You are training a model and notice that the gradient distribution has excess kurtosis of 8.5. What does this suggest about the training dynamics, and what practical step would you take to address it?