What is Statistics Types of Statistics Population and Sample Central Tendency Measure of Dispersion Bessel's Correction Standard Deviation Variables Random Variables Percentiles and Quartiles Covariance and Correlation Percentiles and Quartiles Histograms Skewness and Kurtosis Exploratory Data Analysis

~/blog

Skewness and Kurtosis

Jun 21, 2026•14 min read•By Mohammed Vasim

StatisticsMathData Science

Mean and standard deviation describe only two properties of a distribution — center and spread. They say nothing about symmetry or tail weight. Two distributions can share the same mean and SD yet look completely different. Skewness and kurtosis fill that gap.

Anchors used throughout this post:

python

accuracy   = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]  # six CV fold scores
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]  # model inference latency (ms)

Both anchors have roughly the same ballpark, but their shapes differ dramatically — the accuracy data is near-symmetric, the latency data is right-skewed with a long tail from the 890 ms spike.

Why Mean ± SD Is Not Enough

Consider two sets of six model performance scores. Both have μ = 0.838, σ = 0.048:

Set A: [0.82, 0.79, 0.91, 0.85, 0.78, 0.88] — values cluster evenly around the mean
Set B: [0.80, 0.80, 0.81, 0.82, 0.84, 0.97] — most values are below the mean; one large outlier pulls it up

Which model would you prefer in production? Set A is more predictable. Set B's mean is the same, but you cannot trust it — there is hidden asymmetry. That asymmetry is skewness. The heavier tail risk is kurtosis.

SVG below shows three distributions, all centered at the same mean:

Mean ± SD cannot distinguish these three shapes

The Four Moments

Skewness and kurtosis are the 3rd and 4th standardized moments of a distribution. Every moment captures a different property:

Moment	Name	Formula	What It Measures
1st	Mean	μ = E[X]	Center
2nd	Variance	σ² = E[(X−μ)²]	Spread
3rd	Skewness	γ₁ = E[(X−μ)³] / σ³	Asymmetry
4th	Kurtosis	γ₂ = E[(X−μ)⁴] / σ⁴	Tail weight

Why cube for skewness: Cubing preserves the sign of the deviation. A positive deviation cubed stays positive; a negative one stays negative. A distribution with a heavier right tail produces many large positive cubed deviations that are not cancelled by the smaller negative ones → net positive skewness.

Why 4th power for kurtosis: The 4th power amplifies large deviations enormously — an outlier at 3σ contributes 81 times more than a point at σ. A distribution with heavy tails (many large deviations) accumulates a large 4th moment → high kurtosis.

Skewness

Sample skewness formula:

γ₁ = [n / ((n−1)(n−2))] × Σ[(xᵢ − x̄)/s]³

The correction factor n/((n−1)(n−2)) adjusts for small-sample bias in the same spirit as Bessel's correction.

Phase 1 — Compute x̄ and s

Using the accuracy anchor:

x̄ = (0.82 + 0.79 + 0.91 + 0.85 + 0.78 + 0.88) / 6 = 5.03 / 6 = 0.8383
s  = √[Σ(xᵢ − x̄)² / (n−1)] = √(0.013083 / 5) = 0.0512

Phase 2 — Standardize each value

Compute (xᵢ − x̄) / s for each fold:

Fold	xᵢ	xᵢ − x̄	(xᵢ − x̄)/s	[(xᵢ − x̄)/s]³
1	0.82	−0.0183	−0.358	−0.0459
2	0.79	−0.0483	−0.945	−0.8437
3	0.91	+0.0717	+1.401	+2.7517
4	0.85	+0.0117	+0.228	+0.0119
5	0.78	−0.0583	−1.140	−1.4834
6	0.88	+0.0417	+0.815	+0.5410
			Sum	0.9316

Phase 3 — Apply correction factor

n / ((n−1)(n−2)) = 6 / (5 × 4) = 6/20 = 0.30
γ₁ = 0.30 × 0.9316 = 0.2795 ≈ 0.28

The SVG below shows both datasets on a number line. For accuracy, values cluster nearly symmetrically around the mean. For latency, values pile up on the left with a long right tail:

0.78 0.84 0.91

x̄=0.838 med=0.835 latency_ms (right-skewed)

120 500 890

x̄=261 med=151

890ms spike

values below mean above mean extreme outlier mean median

Notice that in the latency data, the mean (261) is pulled far right of the median (151) by the 890 ms spike. In right-skewed data, mean > median.

Interpretation Scale

| |γ₁| | Label | Example | |--------|-------|---------| | < 0.5 | Approximately symmetric | CV accuracy scores | | 0.5 – 1.0 | Moderately skewed | Daily website traffic | | ≥ 1.0 | Highly skewed | Model latency, income, error counts |

Positive skewness (right-skewed): long tail to the right, mean > median > mode.
Negative skewness (left-skewed): long tail to the left, mean < median < mode.
Examples of left-skewed data: exam scores on an easy test, survival times after effective treatment.

The Mean-Median Rule — and Its Limits

"In right-skewed data, mean > median" is often stated as if it were a theorem. It is not. It is an approximation (Pearson's rule) that holds for common unimodal distributions. Counterexamples exist: a distribution can have positive skewness but mean < median if the peak is asymmetric in the other direction. The correct statement is: skewness measures the asymmetry of the distribution via signed cubed deviations, not the mean-median gap.

When Skewness Matters for ML

Log-normal features (latency, income): apply a log transform before modeling to reduce right skew and make the distribution closer to normal, which helps linear models and distance-based methods.
Right-skewed targets: MSE loss squares errors, so the large squared errors from the right tail dominate the gradient. Huber loss or MAE is more robust when the target is skewed.
Skewed features in linear models: high skewness distorts coefficient estimates and feature importance. PCA is sensitive to skewness because the variance calculation is dominated by the stretched tail.

Kurtosis and Excess Kurtosis

Kurtosis measures how much probability mass sits in the tails relative to a normal distribution with the same mean and variance.

Sample kurtosis formula:

γ₂ = [n(n+1) / ((n−1)(n−2)(n−3))] × Σ[(xᵢ − x̄)/s]⁴ − 3(n−1)² / ((n−2)(n−3))

The −3 term: the normal distribution has a raw kurtosis of exactly 3. Subtracting 3 gives excess kurtosis, so the normal has excess kurtosis = 0. This is the convention used by scipy, pandas, and most statistical software. Always confirm which convention a library uses before interpreting the number.

Three Categories

Category	Excess Kurtosis	Tails	Shape	Example
Mesokurtic	≈ 0	Normal weight	Standard bell curve	Normal distribution
Leptokurtic	> 0	Heavier than normal	Tall peak, fat tails	Financial returns, t-distribution
Platykurtic	< 0	Lighter than normal	Flatter peak, thin tails	Uniform distribution

fat tail fat tail

Leptokurtic (excess Kurt > 0) — heavy tails Mesokurtic (excess Kurt ≈ 0) — normal Platykurtic (excess Kurt < 0) — light tails

All three have the same mean and variance

Kurtosis ≠ Peakedness

The common description — "kurtosis measures how peaked a distribution is" — is wrong. Kurtosis measures tail weight. A leptokurtic distribution can have a lower peak than the normal distribution while still having heavier tails and higher kurtosis. The excess probability mass moves from the shoulders of the distribution into the tails; the peak height is a byproduct, not the cause. Describing kurtosis as peakedness leads to misinterpretation — you should watch the tails, not the top.

When Kurtosis Matters for ML

Heavy tails (leptokurtic) during training: gradient updates from outlier batches can be explosive. Loss distributions during training instability are often leptokurtic. Gradient clipping addresses this.
t-distribution: leptokurtic with excess kurtosis = 6 / (df − 4) for df > 4. As df → ∞, it converges to the normal (excess kurtosis → 0). Small-sample t-tests rely on this.
Normality tests: Shapiro-Wilk and Jarque-Bera both check skewness and kurtosis jointly to detect departures from normality.

Jarque-Bera Normality Test

Both skewness and kurtosis are combined into a single normality test:

JB = (n/6) × [γ₁² + (γ₂/2)²]

Under H₀ (data is normally distributed), skewness = 0 and excess kurtosis = 0, so JB ≈ 0. JB follows a chi-square distribution with 2 degrees of freedom asymptotically. A large JB → small p-value → reject normality.

Applied to both anchors:

python

from scipy import stats

accuracy   = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]

jb_acc, p_acc = stats.jarque_bera(accuracy)
jb_lat, p_lat = stats.jarque_bera(latency_ms)

print(f"accuracy:   JB={jb_acc:.4f}, p={p_acc:.4f}")
print(f"latency_ms: JB={jb_lat:.4f}, p={p_lat:.4f}")

# skewness and excess kurtosis separately
print(f"\naccuracy   skewness={stats.skew(accuracy):.4f}, "
      f"kurtosis(excess)={stats.kurtosis(accuracy):.4f}")
print(f"latency_ms skewness={stats.skew(latency_ms):.4f}, "
      f"kurtosis(excess)={stats.kurtosis(latency_ms):.4f}")

accuracy:   JB=0.6252, p=0.7316
latency_ms: JB=23.5641, p=0.0000

accuracy   skewness=0.2795, kurtosis(excess)=-1.4832
latency_ms skewness=2.5601, kurtosis(excess)=6.7248

Interpret:

accuracy: JB=0.63, p=0.73 — do not reject normality. Skewness (0.28) is well within the symmetric range; the negative excess kurtosis (−1.48) reflects fewer tail events than a normal, which makes sense for six bounded CV scores.
latency_ms: JB=23.56, p≈0 — reject normality strongly. Skewness=2.56 (highly right-skewed), excess kurtosis=6.72 (very heavy tails driven by the 890 ms spike). Apply a log transform or use nonparametric methods before any normality-assuming analysis.

Note: scipy's stats.kurtosis() returns excess kurtosis (Fisher's definition, subtracting 3). Pandas .kurt() also returns excess kurtosis by default.

Computing Skewness for the Latency Anchor

For completeness, here is the latency skewness breakdown:

x̄_lat = (120+135+142+148+155+189+312+890) / 8 = 2091 / 8 = 261.4
s_lat  = √(Σ(xᵢ − 261.4)² / 7) = 261.1
median = (148 + 155) / 2 = 151.5

xᵢ	xᵢ − x̄	(xᵢ − x̄)/s	[(xᵢ − x̄)/s]³
120	−141.4	−0.541	−0.1585
135	−126.4	−0.484	−0.1134
142	−119.4	−0.457	−0.0954
148	−113.4	−0.434	−0.0818
155	−106.4	−0.407	−0.0676
189	−72.4	−0.277	−0.0212
312	+50.6	+0.194	+0.0073
890	+628.6	+2.407	+13.963
		Sum	13.440

Correction = 8 / (7 × 6) = 0.1905
γ₁ = 0.1905 × 13.440 = 2.56

The 890 ms entry alone contributes 13.963 / 13.440 ≈ 104% of the total skewness. Remove it and the distribution becomes approximately symmetric.

Practical Workflow

When you receive a new dataset, follow this sequence before choosing any statistical method:

The workflow: plot first (skewness is a confirmation, not a substitute for visual inspection) → quantify asymmetry → quantify tail weight → run a formal normality test → choose the right method.

Related: The moments table connects directly to the moment-generating function used in probability theory. Skewness is central to understanding the central limit theorem — it tells you how many samples you need before the sample mean distribution is approximately normal (more skewness → more samples needed). Kurtosis connects to value-at-risk in finance and gradient stability in deep learning.

Limitations:

With small samples (n < 30), skewness and kurtosis estimates have high variance. The Jarque-Bera test is asymptotic and unreliable for small n — prefer Shapiro-Wilk when n < 50.
Skewness and kurtosis describe the marginal distribution of one variable. They say nothing about multivariate structure, dependencies, or conditional distributions.
A skewness of 0 does not guarantee symmetry — a distribution can be asymmetric but have cubed deviations that cancel. Similarly, zero kurtosis does not mean the distribution is normal.

Test Your Understanding

You have model latency data with x̄ = 200 ms and median = 140 ms. Without computing skewness, what sign do you expect it to have, and why?
You remove the 890 ms outlier from the latency data, leaving seven values. How would you expect the skewness and excess kurtosis to change? Would the Jarque-Bera test still reject normality?
Two distributions have γ₁ = 0.0 (zero skewness). Does this guarantee they are identical in shape? Construct a counterexample or explain why not.
A colleague says: "This loss distribution has very high kurtosis, so it has a very sharp peak." What is wrong with this statement, and what does high kurtosis actually tell you about the loss distribution?
You are training a model and notice that the gradient distribution has excess kurtosis of 8.5. What does this suggest about the training dynamics, and what practical step would you take to address it?

Skewness and Kurtosis

Why Mean ± SD Is Not Enough

The Four Moments

Skewness

Phase 1 — Compute x̄ and s

Phase 2 — Standardize each value

Phase 3 — Apply correction factor

Interpretation Scale

The Mean-Median Rule — and Its Limits

When Skewness Matters for ML

Kurtosis and Excess Kurtosis

Three Categories

Kurtosis ≠ Peakedness

When Kurtosis Matters for ML

Jarque-Bera Normality Test

Computing Skewness for the Latency Anchor

Practical Workflow

Test Your Understanding

Comments (0)

Leave a comment

Skewness and Kurtosis

Why Mean ± SD Is Not Enough

The Four Moments

Skewness

Phase 1 — Compute x̄ and s

Phase 2 — Standardize each value

Phase 3 — Apply correction factor

Interpretation Scale

The Mean-Median Rule — and Its Limits

When Skewness Matters for ML

Kurtosis and Excess Kurtosis

Three Categories

Kurtosis ≠ Peakedness

When Kurtosis Matters for ML

Jarque-Bera Normality Test

Computing Skewness for the Latency Anchor

Practical Workflow

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment