Back to blog
← View series: statistics

~/blog

Bernoulli Distribution

Apr 11, 20266 min readBy mohammed.vasim
StatisticsMathData Science

Binary decisions are everywhere in machine learning. Does this transaction pass the fraud filter? Does the user click the recommendation? Does the deployment smoke test pass? Each of these is a single trial with two outcomes — and the Bernoulli distribution is the mathematical tool that lets you reason about them precisely. It looks almost too simple to be worth naming, but it turns out to be the atomic unit of discrete probability: almost everything else in the discrete family — Binomial, Geometric, Negative Binomial — builds directly on it.

The DS/ML anchor

Throughout this post we'll work with a single concrete setting: a recommendation system that shows users a product. Each impression is a Bernoulli trial — the user either clicks (success, X = 1) or doesn't (failure, X = 0). Historical click-through rate (CTR) is 0.12, so p = 0.12 for each impression.

The Setup

A Bernoulli random variable X takes value 1 with probability p and value 0 with probability 1 − p. The PMF is:

When x = 1: P(X = 1) = p^1 · (1−p)^0 = 0.12 When x = 0: P(X = 0) = p^0 · (1−p)^1 = 0.88

Notation: X ~ Bernoulli(p).

PMF

The PMF has two bars — nothing more. For our recommendation system with p = 0.12:

0.88 0.12 0 (no click) 1 (click) 0 1 Bernoulli(p=0.12) PMF — click vs no-click per impression.

CDF

The CDF for a Bernoulli variable with p = 0.12 is a step function with two jumps:

0 0.88 1 F(0) = 0.88 0 1 CDF of Bernoulli(0.12) — probability mass 0.88 at 0, then 1.0 at x=1.

Expected Value and Variance

If the system shows 1,000 impressions, the expected number of clicks is 1,000 × 0.12 = 120.

Variance peaks at p = 0.5. At p = 0.12, variance is relatively low — the outcome is predictable (mostly no-click). At p = 0.5 (coin-flip scenario), uncertainty is maximal.

Derivation directly from the definition:

Trace Table: CTR Calculation

Working through specific calculations for our recommendation system (p = 0.12):

PhaseFormulaValuesResult
P(click)p^1 · (1−p)^00.12^1 · 0.88^00.12
P(no click)p^0 · (1−p)^10.12^0 · 0.88^10.88
E[X]p0.120.12
Var(X)p(1−p)0.12 × 0.880.1056

Entropy

The entropy of a Bernoulli variable measures uncertainty:

For our CTR of p = 0.12:

H(X) = −0.12 × log₂(0.12) − 0.88 × log₂(0.88) ≈ 0.529 bits

Compare to p = 0.5: H(X) = 1.0 bit (maximum uncertainty). The recommendation system with 12% CTR has less uncertainty — you can mostly predict "no click" and be right 88% of the time.

Connection to Binomial, Geometric, and Negative Binomial

Bernoulli is the atomic unit that generates three important distributions:

The Binomial distribution is the sum of n independent Bernoulli trials. If you show 200 impressions each with p = 0.12, the total click count follows Binomial(n=200, p=0.12). You're aggregating individual Bernoulli outcomes into a total.

The Geometric distribution counts how many impressions you need until the first click. Each impression is a Bernoulli trial; the Geometric distribution gives the probability that the first success occurs on exactly the k-th trial. P(first click on impression k) = (1−p)^(k−1) · p.

The Negative Binomial distribution generalizes Geometric: it counts how many impressions you need until r clicks occur. If you're trying to collect 10 clicks for a significance test, Negative Binomial tells you the probability of needing exactly k impressions.

These three are not separate ideas — they're the same Bernoulli trials viewed from different questions: how many successes in n trials, when does the first success occur, and how long until r successes.

Python Implementation

python
from scipy import stats
import numpy as np

p_ctr = 0.12
rv = stats.bernoulli(p_ctr)

print(f"P(click)    = {rv.pmf(1):.4f}")
print(f"P(no click) = {rv.pmf(0):.4f}")
print(f"E[X]        = {rv.mean():.4f}")
print(f"Var(X)      = {rv.var():.4f}")
print(f"Entropy     = {rv.entropy():.4f} bits")

n_impressions = 1000
clicks = rv.rvs(size=n_impressions)
print(f"\nSimulated {n_impressions} impressions:")
print(f"Observed CTR  = {clicks.mean():.4f}  (expected: {p_ctr})")
print(f"Total clicks  = {clicks.sum()}")
P(click) = 0.1200 P(no click) = 0.8800 E[X] = 0.1200 Var(X) = 0.1056 Entropy = 0.5294 bits Simulated 1000 impressions: Observed CTR = 0.1190 (expected: 0.12) Total clicks = 119

Bernoulli builds on the PMF and CDF concepts from the first post in this series — the two-bar PMF is the simplest possible PMF you can write down. Understanding Bernoulli deeply is the prerequisite for the Binomial post that follows, because Binomial is nothing more than summing n independent Bernoulli trials. Further downstream, Bernoulli is the foundation for logistic regression (which models the Bernoulli parameter p as a function of features), for the beta-binomial model in Bayesian A/B testing (where p itself gets a prior), and for information-theoretic concepts like cross-entropy loss, which is the negative log-likelihood of a Bernoulli model.

Honest Limitations

Bernoulli assumes a fixed, constant p. Real click-through rates vary by user, time of day, device type, and content context. Treating p = 0.12 as a fixed constant ignores this heterogeneity. When p varies across the population, the observed click pattern follows a Beta-Binomial mixture, not a simple Bernoulli.

Also, Bernoulli handles only two outcomes. Multi-class problems — user clicks one of five recommendation slots — require the Categorical distribution, which is the multi-outcome generalization.

Test Your Understanding

  1. A spam filter classifies each email as spam (1) or not spam (0). The historical spam rate is 23%. Write the full PMF for a single email's outcome, including the formula with values substituted.

  2. For a Bernoulli(p = 0.23) random variable representing the spam filter, calculate E[X], Var(X), and entropy. At what value of p is Var(X) maximized, and why does that make intuitive sense?

  3. You run 500 independent Bernoulli trials with p = 0.12. What is the expected total number of successes? Which distribution models the total count — and with what parameters?

  4. The Geometric distribution models "trials until first success" where each trial is Bernoulli(p). If p = 0.12, what is the probability that the first click occurs on impression 5? On impression 10?

  5. A colleague claims that if you observe 140 clicks out of 1000 impressions, the true CTR is probably not 0.12. What statistical concept would you use to evaluate this claim, and what role does the Bernoulli distribution play?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment