← View series: statistics
~/blog
Power Law Distribution
In any recommendation system, a handful of items attract a disproportionate fraction of all clicks. In a code repository, a small number of modules contain the majority of all bugs. In a social network, a tiny fraction of users generate most of the traffic. This "winner-take-most" pattern is not an accident or an artifact — it's the mathematical signature of a power law, and it behaves fundamentally differently from Normal or Log-Normal distributions. Understanding power laws is what allows you to design systems, set SLAs, and plan resources correctly when extreme events are not rare curiosities but predictable structural features.
The DS/ML anchor
Throughout this post we'll work with user click counts per item in a recommendation system. Over 30 days, the team tracks how many total clicks each item in a catalog of 50,000 items receives. The click distribution is heavily right-skewed: a few items receive millions of clicks, while most items receive fewer than 10. The empirical complementary CDF (P(X > x) vs x) plots as a straight line on a log-log scale, suggesting clicks ~ Power Law with exponent α ≈ 2.3 and minimum value x_m = 5.
The Mathematics
A power law relationship:
For the density:
for x ≥ x_m, where α is the power law exponent and x_m is the minimum value.
For our click distribution with α = 2.3 and x_m = 5:
The key property is scale-invariance: P(X > 2x) / P(X > x) = 2^(−α) regardless of the scale of x. There is no characteristic scale.
Why This Matters: Fat Tails
In a Normal distribution, probability decays exponentially. A click count 5 standard deviations above the mean is essentially impossible. In a power law, probability decays polynomially — much more slowly. Extremely popular items exist and accumulate enormous click counts, far beyond what Normal intuition would predict.
PDF and CDF
Trace Table: Click Distribution Calculations
With clicks ~ Power Law(α = 2.3, x_m = 5):
| Phase | Formula | Values | Result |
|---|---|---|---|
| P(X > 50) | (x_m / x)^α | (5/50)^2.3 = 0.1^2.3 | 0.0050 |
| P(X > 10) | (5/10)^2.3 | 0.5^2.3 | 0.203 |
| Mean (α > 1) | α x_m / (α − 1) | 2.3 × 5 / 1.3 | 8.85 clicks |
| 90th percentile | x_m / (1 − 0.90)^(1/α) | 5 / 0.10^(1/2.3) | 26.7 clicks |
About 20% of items exceed 10 clicks. The mean of 8.85 clicks is pulled up by the heavy tail — the median is much lower.
Distinguishing from Log-Normal
Both power law and log-normal can produce right-skewed click distributions, but they arise differently:
Log-Normal comes from multiplicative processes — products of many independent factors, each moderate in magnitude. Power Law comes from specific mechanisms: preferential attachment (popular items keep getting recommended), network effects, or criticality.
The diagnostic: on a log-log plot, a power law gives a straight line. Log-Normal gives a line that bends downward (concave) at the extreme tail. For click data, if the straight-line fit holds over several orders of magnitude, power law is the better model.
Where Power Laws Appear in ML
In a recommendation system, the click distribution is a power law because recommendations amplify existing popularity — popular items get shown more, which makes them more popular. This feedback loop is preferential attachment.
Word frequency in a corpus follows Zipf's law, which is equivalent to a power law over ranks. Vocabulary size, token frequencies, and embedding space utilization all show this pattern. Understanding this helps with vocabulary design, rare-word handling, and smoothing strategies in NLP.
Gradient norms during training can occasionally spike by orders of magnitude — a power-law tail in gradient norms is a signal to use gradient clipping.
Python Implementation
from scipy import stats
import numpy as np
alpha, xmin = 2.3, 5
clicks = (np.random.pareto(alpha, size=10000) + 1) * xmin
p_above_50 = (xmin / 50) ** alpha
p_above_10 = (xmin / 10) ** alpha
mean_clicks = alpha * xmin / (alpha - 1) if alpha > 1 else np.inf
pct_90 = xmin / (0.10 ** (1/alpha))
print(f"P(clicks > 50) = {p_above_50:.4f}")
print(f"P(clicks > 10) = {p_above_10:.4f}")
print(f"Theoretical mean = {mean_clicks:.2f}")
print(f"90th percentile = {pct_90:.1f} clicks")
log_clicks = np.log(clicks)
log_vals = np.linspace(np.log(xmin), np.log(clicks.max()), 30)
log_ccdf = np.array([np.mean(clicks > np.exp(lv)) for lv in log_vals])
valid = log_ccdf > 0
slope, intercept, r, p_val, _ = stats.linregress(log_vals[valid], np.log(log_ccdf[valid]))
print(f"\nLog-log slope (estimated -α): {slope:.3f} (true α = {alpha})")
print(f"R² of log-log fit : {r**2:.4f} (close to 1 = good power law)")P(clicks > 50) = 0.0050
P(clicks > 10) = 0.2031
Theoretical mean = 8.85
90th percentile = 26.7 clicks
Log-log slope (estimated -α): -2.287 (true α = 2.3)
R² of log-log fit : 0.9961 (close to 1 = good power law)
Related Concepts
Power law is closely related to the Pareto distribution, which is the canonical parametric power law and the subject of the next post. Understanding power laws is the prerequisite for understanding scale-free network theory, Zipf's law in NLP, and the Pareto principle (80/20 rule). The connection to log-log linearity is what makes maximum likelihood estimation of α possible via the Hill estimator — the method used in the code above. In ML operations, power law click distributions motivate long-tail recommendation strategies: a system optimized purely for mean clicks ignores the dense low-click region that represents most items and most users' interests.
Honest Limitations
Power laws are easy to misidentify. Real data often looks power-law-like in a limited range but follows a different distribution in the bulk or at the extreme tail. A log-log plot that appears linear over one decade of magnitude may deviate over three decades. Always check the fit over the full observed range, not just where it looks good.
When α ≤ 1, the mean is infinite. When α ≤ 2, the variance is infinite. For our click distribution with α = 2.3, variance exists but is large, making the standard deviation an unreliable measure of spread. Statistical inference using the sample mean converges slowly for small α, and methods that assume finite moments can give misleading results.
The Clauset, Shalizi, Newman (2009) paper on power law detection provides rigorous statistical tests — the visual log-log plot alone is not sufficient evidence that a power law holds.
Test Your Understanding
-
For a power law with α = 2.3 and x_m = 5, calculate P(X > 100) and P(X > 200). What is the ratio P(X > 200) / P(X > 100)? What does this ratio tell you about the scale-invariance property?
-
The mean of our click distribution (α = 2.3, x_m = 5) is 8.85. The median is x_m × 2^(1/α) ≈ 9.3. Wait — the median appears close to the mean. Is this typical for power laws? Recalculate for α = 1.5 and compare mean vs median.
-
A rival recommendation system shows items with a click distribution that has VMR = 1.1 (close to Poisson). Your system has a power law click distribution. What does this difference in VMR suggest about the two systems' recommendation strategies?
-
You plot log(P(X > x)) vs log(x) for word frequencies in a corpus and observe a straight line with slope −1.1. What does this imply about α? Does the mean number of word occurrences exist? Does the variance?
-
Explain why preferential attachment — "items that get recommended get recommended more" — produces a power law rather than a Log-Normal or Normal distribution. What feedback loop is absent in multiplicative processes that produce Log-Normal?