~/blog

Types of Probability Distributions

Apr 11, 2026•14 min read•By Mohammed Vasim

StatisticsMathData Science

Every time you choose a model in ML — a loss function, a likelihood, a prior — you're implicitly choosing a distribution. A Gaussian loss assumes your residuals are normally distributed. A cross-entropy loss assumes Bernoulli outputs. A Poisson regression assumes counts with mean equal to variance. Making that choice explicitly, rather than accepting defaults, is what separates a practitioner who understands their model from one who debugs blindly.

This post is a mental map: which distribution to reach for, why, and how the distributions connect.

Anchors (one per family):

python

# Discrete: API errors per 10-request batch
errors_per_batch = [0, 1, 0, 2, 0, 1, 0, 0, 3, 1]  # 10 batches, mean = 0.8

# Continuous: model inference latency (right-skewed, ms)
latency_ms = [22, 25, 27, 23, 29, 31, 26, 24, 28, 67]  # 10 measurements

# Binary: single binary predictions
predictions = [1, 0, 1, 1, 0, 1, 1, 1, 0, 1]  # correct=1, wrong=0

The Organizing Taxonomy

Distributions organize along two axes:

Discrete vs Continuous — the type of values they model (counts vs measurements)
Bounded vs Unbounded — whether the support has finite limits

The fundamental split: discrete or continuous? Counting API errors per batch → discrete. Measuring inference latency in ms → continuous. This first question determines which mathematical tools apply.

Discrete Distributions

Bernoulli(p)

Models a single binary trial: success (1) or failure (0) with probability p.

Mean = p, Var = p(1−p)
ML example: one inference call is correct (1) or wrong (0). From the predictions anchor: p = 7/10 = 0.70
Special case: Binomial with n=1

Binomial(n, p)

Models the number of successes in n independent Bernoulli trials. You know the total number of trials in advance.

Mean = np, Var = np(1−p)
ML example: out of n=100 test examples, how many does the model predict correctly? With p=0.70, expected correct = 70
Relationship: as n→∞, p→0 with np=λ → Poisson(λ). By CLT, approaches Normal(np, np(1−p)) for large n

Poisson(λ)

Models the number of events in a fixed interval when events occur independently at constant rate λ.

Mean = λ, Var = λ (mean equals variance — the key diagnostic)
ML example: API errors per 10-request batch. From the errors_per_batch anchor: λ̂ = mean = (0+1+0+2+0+1+0+0+3+1)/10 = 0.8
Relationship: limit of Binomial as n→∞, p→0, np=λ. Poisson(λ) → Normal(λ, λ) for large λ
When to use vs Binomial: when the number of trials is not fixed — you're counting events in a time or space interval

Geometric(p)

Models the number of trials until the first success.

Mean = 1/p, Var = (1−p)/p²
ML example: hyperparameter search trials until finding one that beats baseline. With p=0.15 chance of improvement per trial: expected trials = 1/0.15 ≈ 7
Memoryless property: P(X > m+n | X > m) = P(X > n). Past failures carry no information about when the next success arrives

Negative Binomial(r, p)

Models the number of trials until r successes. Generalization of Geometric (which is Negative Binomial with r=1).

Mean = r(1−p)/p, Var = r(1−p)/p²
ML example: how many data batches processed until r mini-batch gradient updates produce loss improvements
Key use case: overdispersed count data where Var > Mean. Poisson requires Var = Mean exactly. When counts (comments per post, purchases per user) show Var > Mean, Negative Binomial fits better

Hypergeometric(N, K, n)

Sampling without replacement from a population of N items where K are "successes." Unlike Binomial, the probability changes with each draw.

ML example: sampling 50 validation examples from a test set of 200 where 40 are rare positives. The probability of getting exactly k positives depends on what's already been drawn
When to use vs Binomial: finite population + sampling without replacement. As N grows large, Hypergeometric ≈ Binomial (the drawn samples become negligible fractions)

Continuous Distributions

Uniform(a, b)

All values in [a, b] equally likely. f(x) = 1/(b−a).

Mean = (a+b)/2, Var = (b−a)²/12
ML example: uniform hyperparameter search — sampling learning rate uniformly from [1e-5, 1e-2]
Maximum entropy choice when you know only the range and nothing else about the shape

Normal(μ, σ²)

The bell curve. Symmetric, unbounded in both directions.

Mean = μ, Var = σ²
ML example: model prediction residuals from a well-fit linear regression; feature values after standardization; gradient noise in SGD (approximately Normal for large batches by CLT)
When NOT to use: skewed data (use Log-Normal), count data (use Poisson), strictly positive data (use Exponential, Gamma)
Relationships: CLT — sum of many i.i.d. RVs approaches Normal regardless of source distribution. t-distribution → Normal as degrees of freedom → ∞

Exponential(λ)

Models time between events in a Poisson process with rate λ.

Mean = 1/λ, Var = 1/λ²
ML example: time between model API errors; user session duration; time between purchases
Memoryless: P(X > s+t | X > s) = P(X > t). Waiting time has no "memory" of how long you've already waited
Dual relationship: inter-arrival times are Exponential(λ) ↔ event counts are Poisson(λ)
When to use: time-to-event with constant hazard rate (failure rate doesn't increase with age)

Gamma(k, θ)

Generalizes Exponential. Models the sum of k independent Exponential(1/θ) variables.

Mean = kθ, Var = kθ²
ML example: total time for k sequential API calls, each following Exponential
Special cases: k=1 → Exponential(1/θ). k=n/2, θ=2 → Chi-square(n) — directly used in hypothesis testing
When to use: positively skewed continuous data with a known lower bound of 0, with no fixed upper bound

Log-Normal(μ, σ²)

If log(X) ~ Normal(μ, σ²), then X ~ Log-Normal.

Mean = exp(μ + σ²/2), Var = (exp(σ²)−1) × exp(2μ + σ²)
ML example: inference latency (right-skewed, positive-only); file sizes; salary distributions. The latency_ms anchor suggests Log-Normal: most values cluster 22–31ms but one outlier at 67ms creates a right tail
Key property: multiplicative processes → Log-Normal. Additive processes → Normal (CLT)
Relationship: log transform makes it Normal — fit Normal models after log-transforming the data

Beta(α, β)

Defined on [0, 1]. Models probabilities, proportions, and rates.

Mean = α/(α+β), Var = αβ/((α+β)²(α+β+1))
ML example: Bayesian prior for a model's accuracy (bounded between 0 and 1). Conjugate prior for the Binomial likelihood — if prior is Beta(α,β) and you observe k successes in n trials, posterior is Beta(α+k, β+n−k)
Shape: α=β=1 → Uniform[0,1]. α>β → skewed toward 1. α<β → skewed toward 0. α,β large → approximately Normal

Weibull(k, λ)

Generalization of Exponential with a varying hazard rate.

Mean = λ·Γ(1+1/k), Var = λ²[Γ(1+2/k) − (Γ(1+1/k))²]
ML example: time to model degradation in production. Component failure modeling when wear increases over time
Shape parameter k: k=1 → Exponential (constant hazard). k>1 → increasing hazard (things get worse over time). k<1 → decreasing hazard (infant mortality — early failures but survivors are robust)

How Distributions Connect

How to Choose a Distribution

Distribution Reference Table

Distribution	Type	Parameters	Mean	Variance	ML Use Case
Bernoulli	Discrete	p	p	p(1−p)	Single binary prediction
Binomial	Discrete	n, p	np	np(1−p)	Correct predictions in n test examples
Poisson	Discrete	λ	λ	λ	API errors per batch
Geometric	Discrete	p	1/p	(1−p)/p²	Trials until first improvement
Neg. Binomial	Discrete	r, p	r(1−p)/p	r(1−p)/p²	Overdispersed counts
Uniform	Continuous	a, b	(a+b)/2	(b−a)²/12	Hyperparameter sampling
Normal	Continuous	μ, σ²	μ	σ²	Residuals, standardized features
Exponential	Continuous	λ	1/λ	1/λ²	Inter-event waiting times
Gamma	Continuous	k, θ	kθ	kθ²	Sum of k exponential waits
Log-Normal	Continuous	μ, σ²	exp(μ+σ²/2)	(exp(σ²)−1)×exp(2μ+σ²)	Latency, salary
Beta	Continuous	α, β	α/(α+β)	αβ/((α+β)²(α+β+1))	Accuracy priors, CTR
Weibull	Continuous	k, λ	λΓ(1+1/k)	λ²[Γ(1+2/k)−Γ(1+1/k)²]	Time to model degradation

Code: Computing PMF and PDF Probabilities

python

from scipy import stats
import numpy as np

# --- Discrete: Poisson ---
# errors_per_batch anchor: lambda = 0.8
lam = 0.8
print("Poisson(λ=0.8) PMF:")
for k in range(5):
    print(f"  P(errors={k}) = {stats.poisson.pmf(k, lam):.4f}")
print(f"  P(errors >= 2) = {1 - stats.poisson.cdf(1, lam):.4f}")
print()

# --- Continuous: Log-Normal ---
# latency_ms anchor: fit a log-normal
lat = np.array([22, 25, 27, 23, 29, 31, 26, 24, 28, 67])
mu_log = np.mean(np.log(lat))
sigma_log = np.std(np.log(lat), ddof=1)
print(f"Log-Normal fit: mu_log={mu_log:.3f}, sigma_log={sigma_log:.3f}")
p_under_30 = stats.lognorm.cdf(30, s=sigma_log, scale=np.exp(mu_log))
print(f"P(latency < 30ms) = {p_under_30:.3f}")
print()

# --- Poisson assumption check: mean ≈ variance? ---
errors = np.array([0, 1, 0, 2, 0, 1, 0, 0, 3, 1])
print(f"Error counts — mean: {errors.mean():.2f}, variance: {errors.var(ddof=1):.2f}")
print(f"Equidispersion ratio (var/mean): {errors.var(ddof=1)/errors.mean():.2f}  (1.0 = perfect Poisson)")

text

Poisson(λ=0.8) PMF:
  P(errors=0) = 0.4493
  P(errors=1) = 0.3595
  P(errors=2) = 0.1438
  P(errors=3) = 0.0383
  P(errors=4) = 0.0077
  P(errors >= 2) = 0.1912

Log-Normal fit: mu_log=3.272, sigma_log=0.258
P(latency < 30ms) = 0.676

Error counts — mean: 0.80, variance: 1.07
Equidispersion ratio (var/mean): 1.34  (1.0 = perfect Poisson)

The variance/mean ratio of 1.34 (vs ideal 1.0) suggests mild overdispersion — worth considering Negative Binomial if you had more data.

The previous post covers PMF, PDF, and CDF mechanics in detail — the tools you apply once you've chosen a family. Each distribution listed here has its own dedicated post with full derivations, parameter fitting, assumption checking, and ML applications. The relationships shown in the diagram become practically important when: you're approximating a Binomial with Poisson (for rare events at large n), using the Gamma-Chi-square connection in hypothesis tests, or exploiting Beta-Binomial conjugacy in Bayesian inference.

Honest Limitations

Real data rarely fits a single standard distribution perfectly. Zero-inflated count data (many exact zeros plus some large counts) fits neither Poisson nor Negative Binomial — it needs a zero-inflated variant. Bimodal data (two distinct clusters) needs a mixture model, not a single distribution. The distributions covered here are the first choices; when EDA shows they don't fit, the correct move is to test alternatives or use non-parametric methods rather than forcing a fit.

Test Your Understanding

Your team tracks model inference failures per hour. You observe values {0, 0, 1, 0, 3, 0, 1, 2, 0, 1}. Which distribution family does this suggest? Compute the sample mean and variance. Is the equidispersion assumption satisfied?
The latency_ms anchor has one outlier at 67ms with the rest clustered between 22–31ms. Why does Log-Normal fit better than Normal? What transformation would you apply to check the fit?
A colleague says "I'll model accuracy scores from 50 training runs with a Uniform distribution because accuracy is bounded between 0 and 1." What's wrong with this reasoning? What would you use instead and why?
The Binomial(n=1000, p=0.003) should behave like Poisson(λ) for some λ. What is λ? Under what conditions does this approximation work well?
A model's time-to-failure in production is monitored. Early deployments fail frequently; later deployments are more robust. Which distribution models this pattern better than Exponential, and what parameter value explains the early-failure pattern?

Types of Probability Distributions

The Organizing Taxonomy

Discrete Distributions

Bernoulli(p)

Binomial(n, p)

Poisson(λ)

Geometric(p)

Negative Binomial(r, p)

Hypergeometric(N, K, n)

Continuous Distributions

Uniform(a, b)

Normal(μ, σ²)

Exponential(λ)

Gamma(k, θ)

Log-Normal(μ, σ²)

Beta(α, β)

Weibull(k, λ)

How Distributions Connect

How to Choose a Distribution

Distribution Reference Table

Code: Computing PMF and PDF Probabilities

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Types of Probability Distributions

The Organizing Taxonomy

Discrete Distributions

Bernoulli(p)

Binomial(n, p)

Poisson(λ)

Geometric(p)

Negative Binomial(r, p)

Hypergeometric(N, K, n)

Continuous Distributions

Uniform(a, b)

Normal(μ, σ²)

Exponential(λ)

Gamma(k, θ)

Log-Normal(μ, σ²)

Beta(α, β)

Weibull(k, λ)

How Distributions Connect

How to Choose a Distribution

Distribution Reference Table

Code: Computing PMF and PDF Probabilities

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment