Back to blog
← View series: statistics

~/blog

Beta Distribution

Apr 13, 202613 min readBy Mohammed Vasim
StatisticsMathData Science

Every distribution we have seen so far models data: counts, durations, measurements. The Beta distribution models something different — it models a probability itself. When you want to express uncertainty about a model's accuracy, a click-through rate, or a conversion rate, you need a distribution whose support is [0, 1]. That distribution is Beta.

Why We Need a Distribution for Probabilities

Three concrete motivations:

  1. Bayesian inference: when observations are Binomial(n, p), you want a prior over p — a probability of a probability. The Beta distribution is the conjugate prior for Binomial p, which means the posterior is also Beta. The math stays closed-form.

  2. Proportion modeling: accuracy, precision, recall, click-through rate — all are proportions bounded in [0, 1]. After observing finite test data, you have a point estimate but also uncertainty around it. The Beta captures that uncertainty.

  3. Beta-Binomial model: when the proportion p itself varies across instances (different test sets, different users), you model p ~ Beta(α, β) and each instance generates Binomial data. This generalizes logistic regression to heterogeneous populations.

The DS/ML Anchor

A classifier is evaluated on 100 test documents. It gets 85 correct.

k = 85 (correct predictions) n = 100 (total test cases) p̂ = k/n = 0.85 (MLE of accuracy) Prior: Beta(1, 1) — uniform, no prior knowledge Posterior: Beta(α=86, β=16) — after observing 85 correct, 15 incorrect

The posterior encodes: "given this data, here is the probability distribution over the model's true accuracy p."

PDF

f(p; α, β) = p^{α−1} × (1−p)^{β−1} / B(α, β) for p ∈ [0, 1]

where B(α, β) = Γ(α)Γ(β)/Γ(α+β) is the beta function — the normalization constant ensuring ∫₀¹ f(p)dp = 1.

Why this shape? The numerator p^{α−1} × (1−p)^{β−1} is exactly proportional to the Binomial likelihood of observing α−1 successes and β−1 failures:

L(p | k=α−1 successes, n−k=β−1 failures) ∝ p^{α−1}(1−p)^{β−1}

The Beta distribution is built to look like a Binomial likelihood. This is precisely why it is conjugate to the Binomial: multiplying two functions of the same form gives another function of the same form.

How α and β control shape:

α, βShapeMeaning
α = β = 1Flat (Uniform)Complete ignorance about p
α = β > 1Symmetric, peak at 0.5Believe p ≈ 0.5, increasing confidence
α > βRight-skewed, peak > 0.5More successes than failures observed
α < βLeft-skewed, peak < 0.5More failures than successes
α, β < 1U-shaped (bimodal)Believe p is near 0 or near 1, not middle
α=86, β=16Sharp peak near 0.85Strong evidence from 100 observations
Beta PDF shapes — how α and β control the distribution α=1, β=1 (Uniform) 1 0 0 1 α=2, β=2 (symmetric) 1.5 0 0 1 α=5, β=2 (right-skewed) p=0.8 2.5 0 0 1 α=2, β=5 (left-skewed) p=0.2 2.5 0 0 1 α=0.5, β=0.5 (U-shaped) 0 1 0.5 ↑ min here α=86, β=16 (anchor) mode=0.85 0.70 1.0

Mean, Mode, and Variance

All three computed on the anchor Beta(86, 16):

Mean: E[p] = α/(α+β) = 86/102 ≈ 0.843

This is the fraction of total pseudo-observations that are successes. With a uniform prior, it is the Bayesian estimate that shrinks slightly toward 0.5 compared to the MLE.

Mode: (α−1)/(α+β−2) = 85/100 = 0.850

The mode of the posterior equals k/n = 0.85 — exactly the MLE. This is a general result: under a uniform prior, the MAP estimate equals the MLE. The prior Beta(1,1) contributes 0 pseudo-observations to the mode calculation.

Variance: αβ/[(α+β)²(α+β+1)] = 86×16/(102²×103) = 1376/1071612 ≈ 0.001284

Standard deviation ≈ 0.0358. As α+β = n+2 grows, variance shrinks — more data → tighter posterior.

Beta(86, 16) — posterior for model accuracy f(p) p 0.73 0.85 0.97 mean=0.843 mode=0.850 95% credible interval: [0.765, 0.912]

The Conjugate Prior and Bayesian Update

Prior: Beta(α₀, β₀) encodes prior belief about p. Likelihood: Binomial — we observe k successes in n trials. Posterior: Beta(α₀+k, β₀+n−k) — analytically computed, no numerical integration.

Derivation from Bayes' theorem:

posterior(p) ∝ likelihood × prior

= p^k(1−p)^{n−k} × p^{α₀−1}(1−p)^{β₀−1}

= p^{(α₀+k)−1} × (1−p)^{(β₀+n−k)−1}

This is the kernel of Beta(α₀+k, β₀+n−k). The update rule is just addition:

α_posterior = α_prior + k (add observed successes)

β_posterior = β_prior + (n−k) (add observed failures)

Sequential Updating (three steps)

  1. Prior: Beta(1, 1) — uniform. E[p] = 0.5. SD = 0.289. "I know nothing."
  2. After 40 correct out of 50: Beta(41, 11). E[p] = 41/52 ≈ 0.788. SD = 0.057.
  3. After 85 correct out of 100 total: Beta(86, 16). E[p] = 86/102 ≈ 0.843. SD = 0.036.

Each batch of data sharpens and shifts the posterior.

Sequential Bayesian updating f(p) p 0 0.5 1.0 Prior (1,1) Beta(41,11) 40/50 correct Beta(86,16) 85/100 correct As data accumulates, posterior narrows and shifts toward 0.85

α and β as Pseudo-Counts

The most practical way to understand the Beta distribution:

  • Beta(α, β) is equivalent to having observed α−1 successes and β−1 failures before seeing any data.
  • Beta(1, 1): 0 successes + 0 failures observed — complete ignorance. This is the uniform prior.
  • Beta(10, 3): equivalent to having seen 9 successes and 2 failures — you expect p to be high.
  • Beta(86, 16): equivalent to having seen 85 successes and 15 failures — the posterior after our experiment.

When you choose a prior Beta(α₀, β₀), you are saying "I am as confident in my prior belief as I would be if I had already observed α₀−1 successes and β₀−1 failures." This makes priors concrete and interpretable.

Laplace smoothing in Naive Bayes is exactly Beta(2, 2) prior — adding 1 pseudo-success and 1 pseudo-failure to every category to avoid zero probabilities for unseen words.

Bayesian Credible Interval vs Frequentist CI

Both intervals attempt to bound p̂ = 0.85 from 100 observations.

Frequentist 95% CI (Wilson score method):

center = (p̂ + z²/2n) / (1 + z²/n) ≈ 0.850

margin = z × √[p̂(1−p̂)/n + z²/4n²] / (1 + z²/n)

Wilson 95% CI: [0.769, 0.909]

Bayesian 95% credible interval from Beta(86, 16):

[0.765, 0.912] (from dist.ppf(0.025) and dist.ppf(0.975))

Both are similar for large n. The interpretations are fundamentally different:

Frequentist CI [0.769, 0.909]Bayesian CI [0.765, 0.912]
Meaning95% of such intervals contain the true p (over repeated experiments)P(p ∈ [0.765, 0.912]
p isFixed, unknown constantRandom variable with a distribution
Natural forCommunicating to frequentist colleaguesDecision-making ("what is the probability accuracy exceeds 0.80?")

The Bayesian interval directly answers: "How probable is it that accuracy exceeds 0.80?" Answer: P(p > 0.80) = dist.sf(0.80) = 0.9987.

Code

python
from scipy import stats
import numpy as np

# Anchor: 85 correct out of 100
k, n = 85, 100
alpha_prior, beta_prior = 1, 1           # uniform prior

# Posterior update
alpha_post = alpha_prior + k             # = 86
beta_post  = beta_prior + (n - k)        # = 16

dist = stats.beta(alpha_post, beta_post)

# Mean, mode, variance
mean = alpha_post / (alpha_post + beta_post)
mode = (alpha_post - 1) / (alpha_post + beta_post - 2)
var  = (alpha_post * beta_post) / ((alpha_post + beta_post)**2 * (alpha_post + beta_post + 1))

print(f"Posterior: Beta({alpha_post}, {beta_post})")
print(f"Mean:      {mean:.4f}")
print(f"Mode:      {mode:.4f}  (= k/n = MLE)")
print(f"Variance:  {var:.6f}  (SD = {var**0.5:.4f})")

# CDF queries
print(f"\nP(p <= 0.90) = {dist.cdf(0.90):.4f}")
print(f"P(p > 0.80)  = {dist.sf(0.80):.4f}")
print(f"P(p > 0.90)  = {dist.sf(0.90):.4f}")

# 95% credible interval
lo, hi = dist.ppf(0.025), dist.ppf(0.975)
print(f"95% credible interval: [{lo:.4f}, {hi:.4f}]")

# Sequential updates
print("\nSequential updates:")
for k_obs, n_obs in [(0, 0), (40, 50), (85, 100)]:
    a = alpha_prior + k_obs
    b = beta_prior + (n_obs - k_obs)
    d = stats.beta(a, b)
    print(f"  Beta({a:2d},{b:2d}): mean={a/(a+b):.3f}, 95% CI=[{d.ppf(0.025):.3f}, {d.ppf(0.975):.3f}]")

# Comparison with frequentist Wilson CI
z = 1.96
n_f = 100
p_hat = 0.85
denom = 1 + z**2 / n_f
center = (p_hat + z**2/(2*n_f)) / denom
margin = z * np.sqrt(p_hat*(1-p_hat)/n_f + z**2/(4*n_f**2)) / denom
print(f"\nFrequentist Wilson 95% CI: [{center-margin:.4f}, {center+margin:.4f}]")
print(f"Bayesian 95% credible int: [{lo:.4f}, {hi:.4f}]")
Posterior: Beta(86, 16) Mean: 0.8431 Mode: 0.8500 (= k/n = MLE) Variance: 0.001284 (SD = 0.0358) P(p <= 0.90) = 0.9292 P(p > 0.80) = 0.9987 P(p > 0.90) = 0.0708 95% credible interval: [0.7653, 0.9123] Sequential updates: Beta( 1, 1): mean=0.500, 95% CI=[0.025, 0.975] Beta(41,11): mean=0.788, 95% CI=[0.659, 0.892] Beta(86,16): mean=0.843, 95% CI=[0.765, 0.912] Frequentist Wilson 95% CI: [0.7692, 0.9085] Bayesian 95% credible int: [0.7653, 0.9123]

ML Applications

1 — A/B Testing with Thompson Sampling. Two model variants each maintain a Beta posterior over their true accuracy. At each step, sample p_A ~ Beta(α_A, β_A) and p_B ~ Beta(α_B, β_B). Show the variant with the higher sample to the next user. As data accumulates, the winning variant's posterior concentrates and it gets shown more. This naturally balances exploration and exploitation without a fixed exploration rate.

2 — Model Calibration. A classifier reports 85% accuracy on a test set. But the true accuracy has uncertainty. The posterior Beta(86, 16) quantifies this: there is a 7% probability the true accuracy is below 0.80. This matters for deciding when to deploy — a point estimate alone cannot answer "how confident am I that accuracy exceeds the threshold?"

3 — Beta-Binomial Regression. When each instance has its own probability pᵢ (heterogeneity across users, documents, or tasks), model pᵢ ~ Beta(α, β) and each instance generates Binomial(nᵢ, pᵢ) data. The Beta-Binomial marginal distribution (integrating out p) allows fitting such models by maximum likelihood.

4 — Dirichlet Distribution. Beta(α, β) = Dirichlet(α, β) for k=2 categories. The Dirichlet is the conjugate prior for Multinomial parameters (softmax outputs, topic proportions in LDA, class priors in Naive Bayes). Wherever the Beta is used for binary proportions, the Dirichlet generalizes it to k-category proportions.

  • Binomial distribution: the Beta is the conjugate prior for Binomial p. Beta and Binomial form a conjugate pair.
  • Dirichlet distribution: generalization of Beta from 2 to k categories.
  • Gamma distribution: the Beta function B(α, β) = Γ(α)Γ(β)/Γ(α+β) is defined in terms of the Gamma function.

Limitations

  • Unimodality assumption: Beta(α, β) with α, β > 1 is unimodal. Real accuracy distributions can have multiple peaks (different test conditions, distribution shift). A mixture of Betas is more flexible but less tractable.
  • Independent prior: the conjugate update assumes each trial is i.i.d. Binomial. If test examples are correlated (same domain cluster), the posterior is overconfident — it treats dependent data as if it were independent.
  • Sensitivity to prior for small n: with n=5 instead of n=100, the choice between Beta(1,1) and Beta(2,2) matters substantially. Always verify with a prior sensitivity analysis.

Test Your Understanding

  1. A classifier gets 34 correct out of 40 trials. Starting from Beta(1, 1), compute the posterior Beta(α, β), its mean, mode, and 95% credible interval. How do the mean and mode differ, and why does this gap shrink as n grows?

  2. You choose a prior Beta(5, 5) for model accuracy. Interpret this prior in terms of pseudo-counts. After 85 correct out of 100, write the posterior. How does the posterior mean compare to the case with a Beta(1,1) prior?

  3. Prove algebraically that the mode of the posterior under a uniform prior Beta(1,1) equals the MLE k/n. What would the mode equal if you used Beta(2, 2) as the prior instead?

  4. A/B test: variant A has Beta(50, 10) posterior; variant B has Beta(30, 20). Without code, which variant has the higher mean? Higher variance? Which would Thompson Sampling favor more on the next draw, and why might that be better than always showing the one with the higher mean?

  5. You are told: "The Bayesian 95% credible interval [0.765, 0.912] means that if you ran this experiment 100 times, approximately 95 of the resulting intervals would contain the true accuracy." Is this correct? If not, what is the correct interpretation, and what statement does the frequentist CI correctly make?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment