~/blog

Power Law Distribution

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

Most words in any language are rare. A handful of words — "the", "of", "and" — appear in nearly every sentence, while tens of thousands of words appear only a few times per million tokens. This concentration pattern isn't a quirk — it's a mathematical law, and it governs phenomena from web traffic to neural network weight magnitudes. The power law distribution models it.

The DS/ML Anchor

Top word frequencies from a 10M-token training corpus (representing Zipf's law):

Rank	Frequency	Rank	Frequency
1	450,000	11	40,000
2	280,000	12	35,000
3	190,000	13	31,000
4	140,000	14	28,000
5	110,000	15	25,000
6	88,000	16	22,000
7	73,000	17	20,000
8	62,000	18	18,000
9	53,000	19	16,000
10	46,000	20	15,000

The most frequent word (rank 1) appears ~30× more often than the 20th most frequent word.

What a Power Law Is

A power law describes a relationship where one quantity varies as a power of another:

Discrete (rank-frequency): f(r) ∝ r^{-α} — Zipf's law

Continuous PDF: f(x) = ((α−1)/x_min) × (x/x_min)^{-α} for x ≥ x_min

The exponent α controls how fast the distribution decays. Typical values: α ∈ (2, 4) for natural phenomena.

The key property: scale invariance. Scaling x by a constant c changes the distribution by only a multiplicative factor:

P(cx) ∝ (cx)^{-α} = c^{-α} × P(x)

The shape looks the same at every scale — there is no characteristic scale. This is fundamentally different from exponential or Normal distributions:

Exponential: most events cluster near 1/λ; events far from the mean are exponentially rare. The mean is a natural scale.
Power law: there is no typical scale. Events 10× larger than average are only 10^α times rarer — not exponentially rarer. Very large events happen with non-trivial probability.

Power Laws in DS/ML

1. Word frequency (Zipf's law). Frequency of the rth most common word ∝ 1/r^α with α≈1. The most common word is ~2× more frequent than the 2nd, ~10× more frequent than the 10th. This drives tokenizer design, embedding training, and class imbalance in NLP.

2. LLM weight magnitudes. After training, the distribution of weight magnitudes in large language models often follows a power law. This underlies compression — most weights are near-zero, a few are large. Quantization and pruning strategies exploit this structure.

3. User activity distributions. Most users make few requests; a tiny fraction make enormously many. For α≈2, the top 1% of users generate a disproportionate share of all traffic — critical for capacity planning, rate limiting, and recommender systems.

4. Error counts in distributed systems. Cascading failures produce heavy-tailed error counts. A small number of failure events cause most of the damage. Designing for only "average" failure rates misses the rare catastrophic events that actually cause outages.

5. Graph degree distributions. In social networks, citation networks, and the web, node degrees (number of connections) follow a power law. A few "hubs" have vastly more connections than average — the Barabási-Albert model of preferential attachment produces this.

Zipf's Law on the Anchor

Zipf's law: frequency ∝ 1/rank^α. For exact Zipf (α=1): freq(r=1)/freq(r=k) = k.

Ratio verification on anchor:

Ratio	Expected (exact Zipf)	Observed
freq(1) / freq(2)	2.00	450k/280k = 1.61
freq(1) / freq(5)	5.00	450k/110k = 4.09
freq(1) / freq(10)	10.00	450k/46k = 9.78

Close to (but slightly less than) exact Zipf. Fitting log(freq) ~ slope × log(rank) gives slope ≈ −0.89, so the estimated α ≈ 0.89 for this corpus.

The Log-Log Test: Hallmark of a Power Law

In log-log space, a power law becomes a straight line:

log(frequency) = log(C) − α × log(rank)

This is a linear equation in log-log coordinates. The slope = −α. Log-log linearity is necessary but not sufficient for a power law — many distributions look linear over a limited range.

Mean and Variance — When They Exist

This is the most important and frequently misunderstood feature of power laws:

α value	Mean	Variance	CLT applies?	Approach
α ≤ 2	Infinite	Infinite	No	Extreme value theory
2 < α ≤ 3	Finite	Infinite	Weakly	Robust statistics
α > 3	Finite	Finite	Yes	Standard statistics

For a continuous power law with x ≥ x_min:

$E [X] = \frac{x _{m i n} ( α - 1 )}{α - 2}, finite only if α > 2$

$Var (X) = \frac{x _{m i n}^{2} ( α - 1 )}{( α - 2 ) ^{2} ( α - 3 )}, finite only if α > 3$

Implication for ML and statistics:

When α ≤ 2 (some web traffic distributions): the sample mean does not converge to a stable value. Every new batch of data can produce a new extreme event that shifts the mean. Classical statistics breaks down entirely.

When 2 < α ≤ 3: standard errors for the mean grow without bound. Confidence intervals on the mean may be meaningless even with millions of observations.

For the token frequency anchor (α ≈ 0.89 for rank-frequency — this is the Zipf rank exponent, not the same as the continuous power law α for counts), the distribution of frequencies itself has a heavy tail that makes the arithmetic mean a poor summary statistic.

Estimating α — MLE, Not Log-Log Regression

Do NOT use the slope of a log-log regression to estimate α. It gives biased estimates. Use the Clauset-Shalizi-Newman (CSN) maximum likelihood estimator:

$\overset{α}{^} = 1 + n [\sum_{i = 1}^{n} ln (\frac{x _{i}}{x _{m i n}})]^{- 1}$

This requires first estimating x_min — the threshold above which the power law holds. x_min is chosen by minimizing the KS distance between the empirical distribution and the fitted power law.

Use the powerlaw Python package for proper estimation:

python

# pip install powerlaw
import powerlaw
results = powerlaw.Fit(data)
print(results.alpha, results.xmin)

Testing Power Law vs Alternatives

A straight line in log-log space is necessary but not sufficient. Many distributions look linear in log-log over a limited range.

Alternatives to test:

Log-normal: looks like a power law over a range, but eventually curves down (concave) in the extreme tail
Stretched exponential (Weibull with k<1): heavier tail than exponential, lighter than power law
Exponential with heavy tail: only looks linear near the mode

Clauset et al. (2009) procedure:

Estimate α and x_min using MLE
Compute KS statistic between data and fitted power law
Generate 1000 synthetic power-law datasets; compute KS for each
p-value = fraction of synthetic datasets with KS > observed KS
If p > 0.1: power law is plausible. Does NOT mean it's correct — just not ruled out

Likelihood ratio test to compare power law vs log-normal: if the log-likelihood ratio is positive (power law fits better), use a one-sided z-test on the ratio. The powerlaw package implements this as powerlaw.Fit.distribution_compare('power_law', 'lognormal').

Heavy Tail vs Thin Tail

Code

python

import numpy as np
from scipy import stats

rank = np.arange(1, 21)
frequency = np.array([450000, 280000, 190000, 140000, 110000, 88000, 73000, 62000, 53000, 46000,
                       40000, 35000, 31000, 28000, 25000, 22000, 20000, 18000, 16000, 15000])

# Log-log transformation
log_rank = np.log10(rank)
log_freq = np.log10(frequency)

# Linear fit in log-log space (Zipf's law check)
slope, intercept, r_value, p_value, se = stats.linregress(log_rank, log_freq)
alpha_estimate = -slope
print(f"Zipf's law fit on log-log scale:")
print(f"  Estimated α = {alpha_estimate:.3f} (expected ≈ 1 for Zipf's law)")
print(f"  R² = {r_value**2:.4f}")

# Verify power law ratios
print(f"\nRank ratio verification:")
print(f"  freq(rank=1)/freq(rank=2) = {frequency[0]/frequency[1]:.2f} (expected: 2.0 for exact Zipf)")
print(f"  freq(rank=1)/freq(rank=10) = {frequency[0]/frequency[9]:.2f} (expected: 10.0 for exact Zipf)")

# Cumulative share (the long tail)
cum_share = np.cumsum(frequency) / frequency.sum()
print(f"\nCumulative frequency share:")
for k in [1, 5, 10, 20]:
    print(f"  Top {k} words: {cum_share[k-1]*100:.1f}% of all tokens")

# Simplified MLE estimate of alpha (for a continuous power law)
x_min = frequency.min()
n = len(frequency)
alpha_mle = 1 + n * (np.sum(np.log(frequency / x_min)))**(-1)
print(f"\nSimplified MLE estimate of α: {alpha_mle:.3f}")
print("  (Use the 'powerlaw' package for proper x_min estimation)")

text

Zipf's law fit on log-log scale:
  Estimated α = 0.891 (expected ≈ 1 for Zipf's law)
  R² = 0.9978

Rank ratio verification:
  freq(rank=1)/freq(rank=2) = 1.61 (expected: 2.0 for exact Zipf)
  freq(rank=1)/freq(rank=10) = 9.78 (expected: 10.0 for exact Zipf)

Cumulative frequency share:
  Top 1 words: 17.3% of all tokens
  Top 5 words: 45.8% of all tokens
  Top 10 words: 65.5% of all tokens
  Top 20 words: 80.2% of all tokens

Simplified MLE estimate of α: 1.312
  (Use the 'powerlaw' package for proper x_min estimation)

Property Comparison

Property	Power Law	Exponential	Normal
Decay shape	x^{-α}	e^{-λx}	e^{-x²/2σ²}
Tail weight	Very heavy	Light	Very light
Characteristic scale	None (scale-free)	1/λ (mean)	μ (mean)
Mean	Finite iff α > 2	Always 1/λ	Always μ
Variance	Finite iff α > 3	Always 1/λ²	Always σ²
CLT applies?	Only if α > 3	Yes	Yes
Log-log plot	Straight line	Curved	Very curved

Pareto distribution: the canonical parametric power law — see the next post; Pareto(α, x_min) is the same as a continuous power law for x ≥ x_min
Log-normal distribution: also right-skewed but with eventual cutoff in the extreme tail; the log-log plot curves for log-normal
Preferential attachment: the stochastic process (Barabási-Albert model) that generates power-law degree distributions in networks

Limitations

Log-log linearity is not proof. Many distributions look linear in log-log over a limited range. Always test over the full observed range and use formal tests (Clauset et al.).
When α ≤ 2, classical statistics fails. Sample means don't converge; CLT doesn't apply. Any analysis that assumes finite variance is wrong.
Power laws have bounded validity ranges. The power law typically holds only above some x_min. Below x_min, the distribution is different. Misidentifying x_min leads to wrong α estimates and wrong probability calculations.
Confusing rank-frequency α with count-distribution α. Zipf's law α≈0.89 is the rank-frequency exponent. The corresponding count distribution's α is different. They should not be used interchangeably.

Test Your Understanding

For a power law with continuous PDF and α=2.3, x_min=1: compute P(X > 10). What is P(X > 20) / P(X > 10)? Interpret this ratio using scale invariance.
The anchor corpus has the top 20 words accounting for 80.2% of all tokens. The full vocabulary has ~100,000 unique words. What does this imply about the vocabulary distribution for NLP tokenizer design? Should a tokenizer optimize for common or rare words?
A web server logs show user request counts following a power law with α=1.8. Does the mean number of requests per user exist? Does the variance? What does this imply for rate-limiting policy design?
You observe log-log linearity over 2 decades of x (say 10 to 1000). A colleague concludes the data is power-law distributed. What additional checks would you require before accepting this conclusion?
In the Clauset et al. test for power laws, what does a p-value of 0.05 mean — does it confirm the power law, reject it, or neither? Explain what it actually measures.

Power Law Distribution

The DS/ML Anchor

What a Power Law Is

Power Laws in DS/ML

Zipf's Law on the Anchor

The Log-Log Test: Hallmark of a Power Law

Mean and Variance — When They Exist

Estimating α — MLE, Not Log-Log Regression

Testing Power Law vs Alternatives

Heavy Tail vs Thin Tail

Code

Property Comparison

Limitations

Test Your Understanding

Comments (0)

Leave a comment

Power Law Distribution

The DS/ML Anchor

What a Power Law Is

Power Laws in DS/ML

Zipf's Law on the Anchor

The Log-Log Test: Hallmark of a Power Law

Mean and Variance — When They Exist

Estimating α — MLE, Not Log-Log Regression

Testing Power Law vs Alternatives

Heavy Tail vs Thin Tail

Code

Property Comparison

Related Concepts

Limitations

Test Your Understanding

Comments (0)

Leave a comment