← View series: statistics
~/blog
Gamma Distribution
The Exponential distribution models the time until the first event. When you need the time until the k-th event — the total duration of k consecutive training jobs, or the wait for k errors to accumulate — you need the Gamma distribution. It is the direct generalization of the Exponential and the foundation for the Chi-squared and Erlang distributions.
Motivation: Sum of Exponentials
Suppose each training job takes a random amount of time T_i ~ Exponential(λ=0.5 jobs/hour) — meaning an average of 1/λ = 2 hours per job. You want to know the total time for 5 consecutive jobs:
T = T₁ + T₂ + T₃ + T₄ + T₅
Each T_i independently follows the Exponential distribution. Their sum follows the Gamma distribution:
T ~ Gamma(k=5, λ=0.5)
More generally: if T₁, T₂, ..., T_k ~ Exponential(λ) independently, then T₁+...+T_k ~ Gamma(k, λ).
Parameter Conventions
The Gamma distribution has two common parameterizations. Always know which is in use:
| Convention | Parameters | Mean | Variance | Used in |
|---|---|---|---|---|
| Shape-Rate | k, λ (rate) | k/λ | k/λ² | Bayesian statistics, Poisson conjugacy |
| Shape-Scale | α, θ=1/λ (scale) | αθ | αθ² | Engineering, scipy.stats.gamma |
scipy uses Shape-Scale: scipy.stats.gamma(a=k, scale=θ) where θ = 1/λ.
Anchor: k=5, λ=0.5, so θ = 1/0.5 = 2 hours per job. Use scipy.stats.gamma(a=5, scale=2).
The Gamma Function
The PDF involves Γ(k) — the Gamma function, which generalizes factorial to non-integers:
Γ(k+1) = k × Γ(k) (recurrence relation)
Γ(n) = (n−1)! for positive integers
Γ(1/2) = √π (special value)
| k | Γ(k) |
|---|---|
| 1 | Γ(1) = 1 = 0! |
| 2 | Γ(2) = 1 = 1! |
| 3 | Γ(3) = 2 = 2! |
| 4 | Γ(4) = 6 = 3! |
| 5 | Γ(5) = 24 = 4! |
For the anchor: Γ(5) = 24. This is the normalization constant in the PDF denominator.
f(x; k, λ) = λ^k × x^{k−1} × e^{−λx} / Γ(k) for x ≥ 0
Unpacking each component:
- λ^k / Γ(k): normalization constant ensuring ∫₀^∞ f(x)dx = 1
- x^{k−1}: polynomial factor — at k=1 this is x⁰=1 (no polynomial), at k=5 it is x⁴ (creates a peak)
- e^{−λx}: exponential decay ensuring the integral is finite
Connection to Exponential: when k=1, x^{k−1}=1 and Γ(1)=1, so f(x) = λe^{−λx} — exactly the Exponential PDF. Gamma(1, λ) = Exponential(λ).
Compute f(10) on anchor (k=5, λ=0.5):
f(10) = 0.5⁵ × 10⁴ × e^{−5} / 24 = 0.03125 × 10000 × 0.00674 / 24 = 312.5 × 0.00674 / 24 ≈ 0.0878
Compute f(15):
f(15) = 0.03125 × 15⁴ × e^{−7.5} / 24 = 0.03125 × 50625 × 0.000553 / 24 ≈ 0.0366
CDF
No closed form — uses the regularized incomplete Gamma function:
F(x; k, λ) = γ(k, λx) / Γ(k)
where γ(k, λx) = ∫₀^{λx} t^{k−1}e^{−t}dt is the lower incomplete Gamma function. Computed numerically in scipy.
Anchor queries (k=5, λ=0.5, scale=2):
| Query | Result | Interpretation |
|---|---|---|
| P(T ≤ 10) | 0.5595 | 56% chance all 5 jobs finish within 10 hours |
| P(T > 12) | 0.2851 | 29% chance it takes more than 12 hours |
| Median | 9.67 h | Half the time, all 5 jobs finish within 9.67 h |
| 95th percentile | 16.81 h | Need to budget 16.8 h to cover 95% of scenarios |
Mean, Variance, and Skewness
Mean: E[T] = k/λ = k × (mean of Exponential)
Derivation from sum structure: T is the sum of k independent Exponentials, each with mean 1/λ. By linearity of expectation:
E[T] = E[T₁] + ... + E[T_k] = k × (1/λ) = k/λ
Anchor: E[T] = 5/0.5 = 10 hours. This is the peak of the mode? No — the mode is at (k−1)/λ = 4/0.5 = 8 hours. The mean is pulled right by the positive tail.
Variance: Var(T) = k/λ²
Derivation from independence: variance of a sum of independent variables is the sum of variances:
Var(T) = Var(T₁) + ... + Var(T_k) = k × (1/λ²) = k/λ²
Anchor: Var(T) = 5/0.25 = 20, SD = √20 ≈ 4.47 hours
Skewness: 2/√k
Anchor (k=5): skewness = 2/√5 ≈ 0.894 — moderately right-skewed.
For k=20: skewness = 2/√20 ≈ 0.447 — nearly symmetric.
As k → ∞: skewness → 0. This is the CLT in action — Gamma(k, λ) is the sum of k IID Exponentials, so it converges to Normal(k/λ, k/λ²) by the CLT. The Gamma PDF panels above show this convergence.
Special Cases
| Parameters | Special distribution | Relationship |
|---|---|---|
| Gamma(1, λ) | Exponential(λ) | Simplest case: single event |
| Gamma(k, 1/2) | Chi-squared(2k) | Chi-sq is a scaled Gamma |
| Gamma(k/2, 1/2) | Chi-squared(k) | Standard Chi-sq parameterization |
| Gamma(k=integer, λ) | Erlang(k, λ) | Integer-shape special case |
Chi-squared connection (preview for spec 34): X₁², X₂², ..., X_k² where Xᵢ ~ N(0,1). Their sum X₁²+...+X_k² ~ χ²(k). This χ²(k) distribution is exactly Gamma(k/2, 1/2). Equivalently: Gamma(α, β) = (2/λ) × χ²(2α) in distribution. The Gamma is the parent distribution of the Chi-squared.
Erlang connection: the Erlang distribution is Gamma with integer shape, used in queuing theory. For our anchor, T ~ Erlang(5, 0.5) — an integer number of processing stages.
Gamma Function Deep Dive
Three properties worth knowing:
-
Recurrence: Γ(k+1) = k × Γ(k). This lets you compute any Γ by stepping down to Γ(1)=1.
-
Non-integer values: Γ(1/2) = √π ≈ 1.772. This is what appears in the Normal distribution's normalization constant: 1/√(2π) comes from the Gamma function evaluated at 1/2.
-
Stirling's approximation: log Γ(k) ≈ k log k − k for large k — used when computing the PDF in log space to avoid overflow.
For the anchor normalization: Γ(5) = 4! = 24. The PDF integrates to:
∫₀^∞ λ^k x^{k−1} e^{−λx}/Γ(k) dx = (λ^k/Γ(k)) × Γ(k)/λ^k = 1 ✓
Conjugate Prior for Poisson Rate
The Gamma distribution is the conjugate prior for the Poisson rate parameter λ:
Prior: λ ~ Gamma(α, β) Likelihood: X₁, ..., X_n ~ Poisson(λ) — observing event counts Posterior: λ | data ~ Gamma(α + Σxᵢ, β + n)
Update rule (using shape-rate parameterization):
- α_posterior = α_prior + total observed events
- β_posterior = β_prior + number of time periods observed
- Posterior mean = α_posterior / β_posterior
Concrete example: ML training cluster has an error rate. Prior belief: λ ~ Gamma(3, 1) — like having seen 3 errors in 1 hour (mean = 3 errors/hour). Observe 15 errors over 4 hours.
Posterior: λ | data ~ Gamma(3+15, 1+4) = Gamma(18, 5)
Posterior mean = 18/5 = 3.6 errors/hour
Compare to frequentist estimate: λ̂ = total errors / total time = 15/4 = 3.75 errors/hour. The Bayesian estimate shrinks toward the prior mean of 3.0, weighted by the relative strength of prior vs data.
Code
from scipy import stats
import numpy as np
# Anchor: k=5 jobs, lambda=0.5 jobs/hour → scale=2 hours/job
k = 5
lam = 0.5
scale = 1 / lam # = 2.0
dist = stats.gamma(a=k, scale=scale)
# Mean, variance, skewness
mean_theory = k / lam
var_theory = k / lam**2
skewness = 2 / np.sqrt(k)
print(f"Gamma({k}, λ={lam}) → scale={scale}")
print(f"Mean: {mean_theory:.2f} h (E[T]=k/λ)")
print(f"Variance: {var_theory:.2f} (Var=k/λ²) SD={var_theory**0.5:.4f}")
print(f"Skewness: {skewness:.4f} (= 2/√k)")
# CDF queries
print(f"\nP(T ≤ 10) = {dist.cdf(10):.4f}")
print(f"P(T > 12) = {dist.sf(12):.4f}")
print(f"Median: {dist.median():.2f} h")
print(f"95th pct: {dist.ppf(0.95):.2f} h")
# PDF at specific points
print(f"\nf(10) = {dist.pdf(10):.4f}")
print(f"f(15) = {dist.pdf(15):.4f}")
# Special case: Gamma(1, 0.5) = Exponential(0.5)
exp_via_gamma = stats.gamma(a=1, scale=2)
exp_direct = stats.expon(scale=2)
print(f"\nGamma(1,2) CDF at x=4: {exp_via_gamma.cdf(4):.4f}")
print(f"Expon(2) CDF at x=4: {exp_direct.cdf(4):.4f} (should match)")
# Conjugate prior for Poisson rate
alpha_prior, beta_prior = 3, 1 # Gamma(3,1) prior (shape-rate)
observed_events, periods = 15, 4
alpha_post = alpha_prior + observed_events
beta_post = beta_prior + periods
print(f"\nPoisson conjugacy:")
print(f"Prior: Gamma({alpha_prior},{beta_prior}) → mean rate = {alpha_prior/beta_prior:.1f} errors/h")
print(f"After {observed_events} errors in {periods} hours:")
print(f"Posterior: Gamma({alpha_post},{beta_post}) → mean rate = {alpha_post/beta_post:.2f} errors/h")
print(f"Frequentist MLE: {observed_events/periods:.2f} errors/h")Gamma(5, λ=0.5) → scale=2.0
Mean: 10.00 h (E[T]=k/λ)
Variance: 20.00 (Var=k/λ²) SD=4.4721
Skewness: 0.8944 (= 2/√k)
P(T ≤ 10) = 0.5595
P(T > 12) = 0.2851
Median: 9.67 h
95th pct: 16.81 h
f(10) = 0.0878
f(15) = 0.0366
Gamma(1,2) CDF at x=4: 0.8647
Expon(2) CDF at x=4: 0.8647 (should match)
Poisson conjugacy:
Prior: Gamma(3,1) → mean rate = 3.0 errors/h
After 15 errors in 4 hours:
Posterior: Gamma(18,5) → mean rate = 3.60 errors/h
Frequentist MLE: 3.75 errors/h
ML Applications
1 — Training pipeline scheduling. When k pipeline stages (data preprocessing, training, evaluation, upload) each take Exponential time, the total job duration is Gamma distributed. This lets you compute P(pipeline finishes within SLA) and set timeout policies.
2 — Bayesian Poisson rate estimation. Error rates, click rates, and API call rates are Poisson processes. The Gamma prior updates analytically as new observations arrive — useful in online monitoring systems that must update beliefs without reprocessing all historical data.
3 — Erlang queuing models. In M/Erlang queues, service times follow Erlang(k, λ) = Gamma(k, λ) with integer k. This models services that require k sequential phases, each Exponential — more realistic than a single Exponential when there are multiple processing stages (e.g., inference with preprocessing + model forward pass + post-processing).
4 — Gamma priors in Bayesian regression. In Bayesian linear regression, placing a Gamma prior on the precision (inverse variance) τ = 1/σ² yields a Gamma-Normal conjugate model. The posterior for τ is also Gamma, giving closed-form expressions for the posterior predictive distribution.
Related Concepts
- Exponential distribution: Gamma(1, λ) — the single-event special case
- Chi-squared distribution: Gamma(k/2, 1/2) — used for sum of squared standard Normal variables
- Beta distribution: like the Beta, the Gamma appears as a conjugate prior in Bayesian inference, but for rate rather than proportion parameters
- Poisson distribution: the Gamma is the conjugate prior for the Poisson rate parameter
Limitations
- Sum-of-Exponentials interpretation requires i.i.d.: the Gamma(k, λ) decomposition assumes all k stages have the same rate λ. If job times have different rates (pre-processing is faster than training), the sum is a hypo-exponential distribution, not Gamma.
- Integer k restriction for Erlang: for queuing applications, only integer k makes sense physically. Non-integer α shapes are mathematical but have no physical sum-of-Exponentials interpretation.
- scipy parameterization trap:
scipy.stats.gamma(a=5, scale=2)uses scale (not rate). Always verify:dist.mean()should equal k × scale = 5 × 2 = 10. If it doesn't, the parameters are swapped.
Test Your Understanding
-
Five data processing jobs have Exponential(λ=0.5) durations. What is the probability that all five finish within 8 hours? Use the CDF formula and state what scipy call you would use.
-
Compare Gamma(5, λ=0.5) with Gamma(20, λ=0.5). Both have the same λ but different k. Compute the mean, variance, and skewness for each. How does doubling k from 10 to 20 affect the coefficient of variation (CV = SD/mean)?
-
Prove that Gamma(1, λ) = Exponential(λ) by substituting k=1 into the Gamma PDF formula and simplifying. What values do Γ(1) and x^{k−1} take?
-
The Poisson conjugacy example used a Gamma(3, 1) prior and observed 15 errors in 4 hours. What would the posterior mean be if you used an uninformative prior Gamma(0.001, 0.001) instead? As n → ∞, does the prior choice matter?
-
You are told a service's end-to-end latency is Gamma(k=3, λ=2) milliseconds (three sequential Exponential stages, each with mean 0.5 ms). Your SLA requires P(latency > 3 ms) < 0.05. Does this service meet the SLA? Compute the answer using the survival function.