~/blog

Pareto Distribution

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

Most API requests are tiny. A few are enormous. This asymmetry — where a small fraction of observations accounts for a disproportionate share of the total — is quantified by the Pareto distribution. Named after Vilfredo Pareto (1848–1923), who observed that ~80% of Italy's land was owned by ~20% of the population, it is the canonical parametric power law and the foundation for understanding resource concentration in ML systems.

What the Pareto Distribution Models

Three conditions characterize Pareto-distributed data:

Strict lower bound: values cannot go below x_min (there is a minimum possible value)
Heavy tail: large values occur more often than exponential decay would predict
Power-law tail: P(X > x) ∝ x^{-α} — the survival function follows a power law

DS/ML contexts where Pareto arises:

API request payload sizes (the anchor): small requests dominate, occasional huge ones stress infrastructure
User session durations: most sessions are short; a few last many hours — affects server resource allocation
Bug severity: a small fraction of bugs cause most downtime — motivates severity-weighted prioritization
Feature importance in models: a few features drive most predictive power — motivates feature selection
Training example difficulty: most examples are easy; a small fraction require many epochs to learn

Historical origin: Pareto discovered this pattern in land ownership. The "80/20 rule" (80% of outcomes from 20% of causes) is a special case — it corresponds to α ≈ 1.161, not a universal constant.

The DS/ML Anchor

API request payload sizes from a web service (in KB), x_min = 1.0 KB (protocol minimum):

text

request_sizes = [1.2, 1.8, 2.1, 1.5, 3.2, 4.7, 1.9, 2.8, 1.3, 6.1,
                 1.1, 2.5, 1.4, 8.9, 1.7, 3.8, 1.6, 15.4, 2.2, 45.3]

The 45.3 KB request is ~38× the smallest. The sample mean (6.1 KB) is pulled far above the median by the tail.

PDF and CDF

PDF: f(x; α, x_m) = α × x_m^α / x^{α+1} for x ≥ x_m > 0

Unpacking each component:

Component	Role
α × x_m^α	Normalization constant ensuring ∫f(x)dx = 1 over [x_m, ∞)
x^{α+1} in denominator	Power-law decay — larger x → smaller density
α controls	Tail weight — larger α = faster decay = lighter tail

CDF: F(x) = 1 − (x_m/x)^α — one of few distributions with exact closed-form CDF.

Survival function (CCDF): S(x) = P(X > x) = (x_m/x)^α

The survival function is the most natural expression for heavy-tailed distributions — it directly answers "what is the probability of observing a value larger than x?"

The 80/20 Rule Is a Special Case

The Pareto principle: 80% of outcomes come from 20% of causes. This is not a universal law — it is a consequence of a specific α.

Derivation: the top fraction p of the population (i.e., those above the 1−p quantile) contributes what fraction of the total?

For Pareto, the fraction of total contributed by the top portion 1−q is:

$fraction = q^{(α - 1) / α}$

Setting the top 20% (p=0.2, so q=0.8) to contribute 80% (fraction=0.2):

0.2 = 0.8^{(α−1)/α} → (α−1)/α × log(0.8) = log(0.2) → α = log(0.2)/log(0.8) / (1 + log(0.2)/log(0.8)) ≈ 1.161

α	Top 20% contributes
1.0	100% (all mass in the tail)
1.161	80% (the 80/20 rule)
1.387 (anchor MLE)	~73%
2.0	~55%
3.0	~42%

Mean, Variance, and the Moments Table

Mean: E[X] = α × x_m / (α − 1), finite only when α > 1

Variance: Var[X] = x_m² × α / [(α−1)² × (α−2)], finite only when α > 2

α range	Mean	Variance	Skewness	CLT applies?
0 < α ≤ 1	∞	∞	∞	No — extreme value theory
1 < α ≤ 2	Finite	∞	∞	No — robust statistics
2 < α ≤ 3	Finite	Finite	∞	Weakly
α > 3	Finite	Finite	Finite	Yes — standard statistics

Implication: when the estimated α falls in (1, 2), the theoretical variance is infinite. The sample variance you compute will not stabilize as you collect more data — it will keep growing as larger and larger requests appear. This means standard errors and confidence intervals are meaningless for such data.

MLE for α

Exact closed-form MLE (given known x_min):

$\overset{α}{^} = \frac{n}{\sum _{i = 1}^{n} l n ( x _{i} / x _{m i n} )}$

Step-by-step on anchor (n=20, x_min=1.0):

xᵢ	ln(xᵢ/1.0)
1.2	0.182
1.8	0.588
45.3	3.813
...	...

Sum of all 20 log ratios ≈ 14.42

α̂ = 20 / 14.42 = 1.387

Since 1 < α̂ < 2: mean exists, variance is infinite for this request size distribution.

scipy convention: scipy.stats.pareto uses parameters b (= α) and scale (= x_min). Always specify scale=x_min explicitly.

Pareto vs Exponential — The Key Distinction

Both model positive, right-skewed data. The choice determines whether extreme values are plausible or essentially impossible.

Property	Pareto	Exponential
PDF	α×x_m^α/x^{α+1}	λe^{-λx}
Tail decay	Power law (slow)	Exponential (fast)
Memoryless	No	Yes
Mean	Finite iff α>1	Always 1/λ
Variance	Finite iff α>2	Always 1/λ²
Use when	File sizes, wealth, request sizes	Inter-arrival times at constant rate

Diagnostic test:

If log(S(x)) vs log(x) is linear → Pareto (power law tail)
If log(S(x)) vs x is linear → Exponential (exponential tail)

Assumption Violations

1. x_min is wrong. If x_min is set too low, you fit a power law to the body where the distribution is not Pareto. Fix: use Clauset-Shalizi-Newman procedure to estimate x_min by minimizing KS distance.

2. Bounded range. If there is a hard upper limit (e.g., request size cannot exceed 10MB), the Pareto model is misspecified. Use truncated Pareto or a bounded distribution.

3. Mixed populations. If two request types (GET vs POST) have different α values, fitting a single α to the combined data gives a wrong result.

4. Log-log linearity is not proof. A straight line over 1–2 decades is necessary but not sufficient. Use KS goodness-of-fit testing with the Clauset et al. procedure.

Code

python

import numpy as np
from scipy import stats

request_sizes = np.array([1.2, 1.8, 2.1, 1.5, 3.2, 4.7, 1.9, 2.8, 1.3, 6.1,
                           1.1, 2.5, 1.4, 8.9, 1.7, 3.8, 1.6, 15.4, 2.2, 45.3])
x_min = 1.0  # protocol minimum

# MLE for alpha
log_ratios = np.log(request_sizes / x_min)
alpha_mle = len(request_sizes) / log_ratios.sum()
print(f"MLE estimate: α = {alpha_mle:.3f}")
print(f"(α > 1: finite mean; α > 2: finite variance)")

# Theoretical statistics
if alpha_mle > 1:
    mean_theory = alpha_mle * x_min / (alpha_mle - 1)
    print(f"Theoretical mean (α>1): {mean_theory:.3f} KB")
else:
    print("Theoretical mean: infinite (α ≤ 1)")

if alpha_mle > 2:
    var_theory = x_min**2 * alpha_mle / ((alpha_mle - 1)**2 * (alpha_mle - 2))
    print(f"Theoretical variance (α>2): {var_theory:.3f}")
else:
    print("Theoretical variance: infinite (α ≤ 2)")

# Sample statistics
print(f"\nSample mean: {request_sizes.mean():.3f} KB")
print(f"Sample std: {request_sizes.std(ddof=1):.3f} KB")
print(f"Sample max: {request_sizes.max():.1f} KB")

# 80/20 check: what fraction of size comes from top 20% of requests?
n20 = int(0.20 * len(request_sizes))
top20_sorted = np.sort(request_sizes)[-n20:]
pct_size = top20_sorted.sum() / request_sizes.sum()
print(f"\nTop 20% of requests ({n20} requests) account for {pct_size*100:.1f}% of total size")

# CDF (survival function)
x_query = 10.0
surv = (x_min / x_query) ** alpha_mle
print(f"\nP(request size > {x_query} KB) = ({x_min}/{x_query})^{alpha_mle:.3f} = {surv:.4f}")

# Scipy Pareto (b = alpha, scale = x_min)
pareto_dist = stats.pareto(b=alpha_mle, scale=x_min)
print(f"Scipy verification: P(X > {x_query}) = {pareto_dist.sf(x_query):.4f}")

text

MLE estimate: α = 1.387
(α > 1: finite mean; α > 2: finite variance)
Theoretical mean (α>1): 3.567 KB
Theoretical variance: infinite (α ≤ 2)

Sample mean: 6.145 KB
Sample std: 10.162 KB
Sample max: 45.3 KB

Top 20% of requests (4 requests) account for 73.4% of total size

P(request size > 10.0 KB) = (1.0/10.0)^1.387 = 0.0409
Scipy verification: P(X > 10.0) = 0.0409

Parameter Reference

Parameter	Symbol	Role	Effect of Increasing
Shape (tail index)	α	Controls tail weight	Lighter tail, more concentrated near x_min
Scale (minimum)	x_m	Lower bound of support	Shifts entire distribution right

Power law distribution: the general concept; Pareto is its canonical parametric form
Lorenz curve / Gini coefficient: summary statistics for inequality derived directly from Pareto α
Extreme value theory: the mathematical framework for analyzing distributions when α ≤ 2 and classical statistics breaks down

Limitations

When α ≤ 2, sample variance never stabilizes. As more data arrives, larger extreme values appear and inflate the variance. Standard errors computed from such data are meaningless.
The 80/20 rule is not universal. For α=1.387 (the anchor), the top 20% accounts for ~73% — not 80%. The specific ratio depends entirely on α.
Pareto requires exact power-law tails. Real distributions are often only approximately Pareto in a range. Always validate with formal goodness-of-fit testing before relying on Pareto-derived tail probabilities.

Test Your Understanding

For the anchor (α=1.387, x_min=1.0 KB): compute P(request size > 20 KB) and the 99th percentile request size. What does the 99th percentile tell you about infrastructure provisioning?
The theoretical variance is infinite for α=1.387. You compute the sample variance from the 20 observations and get 103.3 KB². Is this estimate stable? What happens to this estimate as you collect more data?
A colleague fits a Pareto distribution to request sizes and reports "the mean request size is 3.57 KB." Another colleague says "the mean is meaningless for this data." Under what conditions would the second colleague be right? For our α=1.387, is the mean meaningful?
Derive: for Pareto with α=2 and x_min=1, what fraction of total size is contributed by the top 10% of requests? Use the Lorenz curve formula.
You observe request sizes and fit two models: Pareto(α=1.387, x_min=1) and Exponential(λ=0.163). Both have the same sample mean (~6.1 KB). How would you decide which fits better? Describe two specific diagnostic tests.

Pareto Distribution

What the Pareto Distribution Models

The DS/ML Anchor

PDF and CDF

The 80/20 Rule Is a Special Case

Mean, Variance, and the Moments Table

MLE for α

Pareto vs Exponential — The Key Distinction

Assumption Violations

Code

Parameter Reference

Limitations

Test Your Understanding

Comments (0)

Leave a comment

Pareto Distribution

What the Pareto Distribution Models

The DS/ML Anchor

PDF and CDF

The 80/20 Rule Is a Special Case

Mean, Variance, and the Moments Table

MLE for α

Pareto vs Exponential — The Key Distinction

Assumption Violations

Code

Parameter Reference

Related Concepts

Limitations

Test Your Understanding

Comments (0)

Leave a comment