~/blog

Bernoulli Distribution

Apr 11, 2026•8 min read•By Mohammed Vasim

StatisticsMathData Science

Binary decisions are everywhere in machine learning. Does this transaction pass the fraud filter? Does the user click the recommendation? Does the deployment smoke test pass? Each of these is a single trial with two outcomes — and the Bernoulli distribution is the mathematical tool that lets you reason about them precisely. It looks almost too simple to be worth naming, but it turns out to be the atomic unit of discrete probability: almost everything else in the discrete family — Binomial, Geometric, Negative Binomial — builds directly on it.

The DS/ML anchor

Throughout this post we'll work with a single concrete setting: a recommendation system that shows users a product. Each impression is a Bernoulli trial — the user either clicks (success, X = 1) or doesn't (failure, X = 0). Historical click-through rate (CTR) is 0.12, so p = 0.12 for each impression.

The Setup

A Bernoulli random variable X takes value 1 with probability p and value 0 with probability 1 − p. The PMF is:

$P (X = x) = p^{x} (1 - p)^{1 - x} for x \in {0, 1}$

When x = 1: P(X = 1) = p^1 · (1−p)^0 = 0.12 When x = 0: P(X = 0) = p^0 · (1−p)^1 = 0.88

Notation: X ~ Bernoulli(p).

PMF

The PMF has two bars — nothing more. For our recommendation system with p = 0.12:

CDF

The CDF for a Bernoulli variable with p = 0.12 is a step function with two jumps:

$F (x) = ⎩ ⎨ ⎧ 0 1 - p = 0.88 1 x < 0 0 \leq x < 1 x \geq 1$

Expected Value and Variance

$E [X] = p = 0.12$

If the system shows 1,000 impressions, the expected number of clicks is 1,000 × 0.12 = 120.

$Var (X) = p (1 - p) = 0.12 \times 0.88 = 0.1056$

Variance peaks at p = 0.5. At p = 0.12, variance is relatively low — the outcome is predictable (mostly no-click). At p = 0.5 (coin-flip scenario), uncertainty is maximal.

Derivation directly from the definition:

$E [X] = 0 \times (1 - p) + 1 \times p = p$ $E [X^{2}] = 0^{2} \times (1 - p) + 1^{2} \times p = p$ $Var (X) = E [X^{2}] - (E [X])^{2} = p - p^{2} = p (1 - p)$

Variance peaks at p = 0.5 (maximum uncertainty) and is zero at p = 0 or p = 1 (completely predictable). The curve is an inverted parabola:

Skewness: (1−2p) / √(p(1−p))

For the anchor (p=0.12):

text

skewness = (1 − 2×0.12) / √(0.12×0.88)
         = 0.76 / √0.1056
         = 0.76 / 0.3250
         = 2.34

Positive skewness confirms the distribution is right-skewed: the dominant outcome is X=0 (no click), with X=1 being the rare event.

Kurtosis (excess): (1−6p(1−p)) / (p(1−p))

For p=0.12: (1 − 6×0.1056) / 0.1056 = (1 − 0.6336) / 0.1056 = 0.3664 / 0.1056 = 3.47

Positive excess kurtosis indicates heavier tails than a Normal distribution — expected since Bernoulli places all mass at two discrete points.

MLE: Estimating p from Data

The parameter p is estimated by maximum likelihood. Given n observations x₁, x₂, ..., xₙ (each 0 or 1), the likelihood is:

text

L(p) = Π [pˣⁱ × (1−p)^(1−xᵢ)]

Taking the log-likelihood:

text

ℓ(p) = Σ [xᵢ log p + (1−xᵢ) log(1−p)]
      = (Σ xᵢ) log p + (n − Σ xᵢ) log(1−p)

Differentiating and setting to zero:

text

dℓ/dp = (Σ xᵢ)/p − (n − Σ xᵢ)/(1−p) = 0
→ (Σ xᵢ)(1−p) = (n − Σ xᵢ)p
→ p̂ = (1/n) Σ xᵢ = x̄

The MLE for p is simply the sample proportion — the fraction of successes. From our anchor: 119 clicks out of 1000 impressions → p̂ = 0.119.

Trace Table: CTR Calculation

Working through specific calculations for our recommendation system (p = 0.12):

Phase	Formula	Values	Result
P(click)	p^1 · (1−p)^0	0.12^1 · 0.88^0	0.12
P(no click)	p^0 · (1−p)^1	0.12^0 · 0.88^1	0.88
E[X]	p	0.12	0.12
Var(X)	p(1−p)	0.12 × 0.88	0.1056

Entropy

The entropy of a Bernoulli variable measures uncertainty:

$H (X) = - p lo g_{2} (p) - (1 - p) lo g_{2} (1 - p)$

For our CTR of p = 0.12:

H(X) = −0.12 × log₂(0.12) − 0.88 × log₂(0.88) ≈ 0.529 bits

Compare to p = 0.5: H(X) = 1.0 bit (maximum uncertainty). The recommendation system with 12% CTR has less uncertainty — you can mostly predict "no click" and be right 88% of the time.

Connection to Binomial, Geometric, and Negative Binomial

Bernoulli is the atomic unit that generates three important distributions:

The Binomial distribution is the sum of n independent Bernoulli trials. If you show 200 impressions each with p = 0.12, the total click count follows Binomial(n=200, p=0.12). You're aggregating individual Bernoulli outcomes into a total.

The Geometric distribution counts how many impressions you need until the first click. Each impression is a Bernoulli trial; the Geometric distribution gives the probability that the first success occurs on exactly the k-th trial. P(first click on impression k) = (1−p)^(k−1) · p.

The Negative Binomial distribution generalizes Geometric: it counts how many impressions you need until r clicks occur. If you're trying to collect 10 clicks for a significance test, Negative Binomial tells you the probability of needing exactly k impressions.

These three are not separate ideas — they're the same Bernoulli trials viewed from different questions: how many successes in n trials, when does the first success occur, and how long until r successes.

Python Implementation

python

from scipy import stats
import numpy as np

p_ctr = 0.12
rv = stats.bernoulli(p_ctr)

print(f"P(click)    = {rv.pmf(1):.4f}")
print(f"P(no click) = {rv.pmf(0):.4f}")
print(f"E[X]        = {rv.mean():.4f}")
print(f"Var(X)      = {rv.var():.4f}")
print(f"Entropy     = {rv.entropy():.4f} bits")

n_impressions = 1000
clicks = rv.rvs(size=n_impressions)
print(f"\nSimulated {n_impressions} impressions:")
print(f"Observed CTR  = {clicks.mean():.4f}  (expected: {p_ctr})")
print(f"Total clicks  = {clicks.sum()}")

text

P(click)    = 0.1200
P(no click) = 0.8800
E[X]        = 0.1200
Var(X)      = 0.1056
Entropy     = 0.5294 bits

Simulated 1000 impressions:
Observed CTR  = 0.1190  (expected: 0.12)
Total clicks  = 119

Bernoulli builds on the PMF and CDF concepts from the first post in this series — the two-bar PMF is the simplest possible PMF you can write down. Understanding Bernoulli deeply is the prerequisite for the Binomial post that follows, because Binomial is nothing more than summing n independent Bernoulli trials. Further downstream, Bernoulli is the foundation for logistic regression (which models the Bernoulli parameter p as a function of features), for the beta-binomial model in Bayesian A/B testing (where p itself gets a prior), and for information-theoretic concepts like cross-entropy loss, which is the negative log-likelihood of a Bernoulli model.

Honest Limitations

Bernoulli assumes a fixed, constant p. Real click-through rates vary by user, time of day, device type, and content context. Treating p = 0.12 as a fixed constant ignores this heterogeneity. When p varies across the population, the observed click pattern follows a Beta-Binomial mixture, not a simple Bernoulli.

Also, Bernoulli handles only two outcomes. Multi-class problems — user clicks one of five recommendation slots — require the Categorical distribution, which is the multi-outcome generalization.

Test Your Understanding

A spam filter classifies each email as spam (1) or not spam (0). The historical spam rate is 23%. Write the full PMF for a single email's outcome, including the formula with values substituted.
For a Bernoulli(p = 0.23) random variable representing the spam filter, calculate E[X], Var(X), and entropy. At what value of p is Var(X) maximized, and why does that make intuitive sense?
You run 500 independent Bernoulli trials with p = 0.12. What is the expected total number of successes? Which distribution models the total count — and with what parameters?
The Geometric distribution models "trials until first success" where each trial is Bernoulli(p). If p = 0.12, what is the probability that the first click occurs on impression 5? On impression 10?
A colleague claims that if you observe 140 clicks out of 1000 impressions, the true CTR is probably not 0.12. What statistical concept would you use to evaluate this claim, and what role does the Bernoulli distribution play?

Bernoulli Distribution

The DS/ML anchor

The Setup

PMF

CDF

Expected Value and Variance

MLE: Estimating p from Data

Trace Table: CTR Calculation

Entropy

Connection to Binomial, Geometric, and Negative Binomial

Python Implementation

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Bernoulli Distribution

The DS/ML anchor

The Setup

PMF

CDF

Expected Value and Variance

MLE: Estimating p from Data

Trace Table: CTR Calculation

Entropy

Connection to Binomial, Geometric, and Negative Binomial

Python Implementation

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment