~/blog

Bayes' Theorem

Apr 11, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

Your fraud detection model has 95% recall. A transaction gets flagged. What is the probability it is actually fraudulent?

Most people answer "around 95% — the model said so." The correct answer is 27.9%. The gap between those two numbers is the most important concept in probabilistic reasoning: base rate neglect. Bayes' Theorem is the formula that computes the right number. Everything else in this post is unpacking why the gap is so large, and what to do about it.

The Anchor

python

# Prior probabilities (historical data)
P_fraud = 0.02          # 2% of transactions are fraudulent
P_legit  = 0.98         # 98% are legitimate

# Model performance
P_flagged_given_fraud = 0.95   # sensitivity / recall
P_flagged_given_legit = 0.05   # false positive rate

# Question: a transaction was flagged — P(fraud | flagged) = ?

Derivation from First Principles

Bayes' Theorem is not an axiom — it follows from the definition of conditional probability.

Step 1. Conditional probability definition: P(A|B) = P(A ∩ B) / P(B) ... (i)

Step 2. Apply the same definition symmetrically: P(B|A) = P(A ∩ B) / P(A) → P(A ∩ B) = P(B|A) × P(A) ... (ii)

Step 3. Substitute (ii) into (i): P(A|B) = [P(B|A) × P(A)] / P(B)

This is Bayes' Theorem. No new assumptions — only the multiplication rule applied twice.

The Three Terms

Define these BEFORE substituting any numbers:

Prior P(A): what you believed before seeing the evidence. P(fraud) = 0.02 — this comes from historical transaction data, not from the model. You knew fraud rate was 2% before any model ran.

Likelihood P(B|A): how probable is the evidence B, given A is true? P(flagged | fraud) = 0.95 — the model's sensitivity. This is the number everyone focuses on.

Posterior P(A|B): what you believe after seeing the evidence. P(fraud | flagged) = ? — the number you actually need.

Evidence P(B): the normalizing constant. P(flagged) — the total rate at which transactions get flagged, regardless of whether they are fraudulent.

Mnemonic: posterior ∝ prior × likelihood. The evidence just ensures the posterior sums to 1.

Law of Total Probability for the Denominator

P(flagged) is rarely directly available. Compute it by summing over all ways a transaction can be flagged:

P(flagged) = P(flagged | fraud) × P(fraud) + P(flagged | legit) × P(legit)

Substituting anchor values: P(flagged) = 0.95 × 0.02 + 0.05 × 0.98 = 0.019 + 0.049 = 0.068

6.8% of all transactions get flagged. The 0.049 term — legitimate transactions flagged incorrectly — is almost three times larger than the 0.019 term. That is why the posterior is so much lower than 95%.

Full Posterior Calculation

P(fraud | flagged) = P(flagged | fraud) × P(fraud) / P(flagged) = 0.95 × 0.02 / 0.068 = 0.019 / 0.068 = 0.279

Only 27.9% of flagged transactions are actually fraudulent. The low prior (2% fraud rate) overwhelms the model's high recall.

Natural Frequencies (10,000 Transactions)

Ratios are abstract. Counts make the mechanism visible:

	Flagged	Not Flagged	Total
Fraud	190	10	200
Legit	490	9,310	9,800
Total	680	9,320	10,000

200 fraudulent transactions (2%), model flags 95% = 190. Misses 10.
9,800 legitimate transactions (98%), model wrongly flags 5% = 490.
Total flagged = 190 + 490 = 680.
Among flagged: 190/680 = 27.9% fraud, 490/680 = 72.1% false alarms.

Bayesian Tree Diagram

Base Rate Neglect: The Prosecutor's Fallacy

The error: P(flagged | fraud) = 0.95. "So if flagged, it's 95% likely to be fraud."

This confuses P(flagged | fraud) with P(fraud | flagged). They are not the same. In ML:

Precision = P(truth | predicted positive) = P(A|B). This is the posterior. Recall = P(predicted positive | truth) = P(B|A). This is the likelihood.

Reporting recall as precision is the base rate fallacy. Without knowing the prior (2% fraud rate) and the false positive rate (5%), you cannot compute precision from recall alone.

In legal settings, prosecutors sometimes argue: "the DNA match probability by chance is 1 in a million, so the defendant is almost certainly guilty." This ignores the prior probability that a random person in a city of millions would match by chance — the same structure as reporting recall as precision.

Sensitivity Analysis: How the Prior Changes the Answer

Same model likelihoods (0.95 recall, 0.05 FPR), different fraud prevalence:

Prior P(fraud)	P(flagged)	P(fraud\|flagged)
0.001 (0.1%)	0.05090	0.019 (1.9%)
0.01 (1%)	0.05900	0.161 (16.1%)
0.02 (2%)	0.06800	0.279 (27.9%)
0.10 (10%)	0.14000	0.679 (67.9%)
0.50 (50%)	0.50000	0.950 (95.0%)

When fraud is very rare (0.1%), even a 95% sensitive model produces 98.1% false alarms among its flags. The posterior is dominated by the prior when the prior is extreme.

Bayesian Updating: Sequential Evidence

One of the most important Bayesian concepts: the posterior from one observation becomes the prior for the next. Each piece of evidence refines the belief.

Scenario: two consecutive flagged transactions from the same account.

Update 1 (first flag): Prior = 0.02 → Posterior₁ = 0.279 (computed above)

Update 2 (second flag, same model): New prior = Posterior₁ = 0.279

P(flagged) = 0.95 × 0.279 + 0.05 × 0.721 = 0.265 + 0.036 = 0.301

Posterior₂ = (0.95 × 0.279) / 0.301 = 0.265 / 0.301 = 0.880

After two consecutive flags: belief rises from 2% → 27.9% → 88%.

Sequential Update Table

Observation	Prior	Likelihood	Evidence	Posterior
Before any flag	0.020	—	—	0.020
After 1st flag	0.020	0.95	0.068	0.279
After 2nd flag	0.279	0.95	0.301	0.880

Applications in ML

1. Bayesian hyperparameter optimization (TPE): At each trial, the prior over hyperparameter space is updated based on observed performance. The acquisition function (expected improvement) is a posterior-based decision rule — it picks the next configuration that is most likely to beat the current best, given what has been observed.

2. Naive Bayes classifier: Computes P(class | features) = P(features | class) × P(class) / P(features). The "naive" assumption is that features are independent given the class. This makes the likelihood P(features | class) tractable as a product of individual feature likelihoods. The assumption rarely holds literally, but the classifier performs surprisingly well.

3. Bayesian A/B testing: Instead of a binary reject/fail-to-reject decision, compute the full posterior over the effect size δ. Report P(δ > 0 | data) — the probability the treatment is better — directly. No multiple comparison correction needed, and the result is the quantity practitioners actually want.

Code