← View series: statistics
~/blog
Bayes' Theorem
Your spam classifier flags an email as spam. The classifier has 95% recall on spam and 90% precision. What is the actual probability this specific email is spam? Your instinct says "probably high." The math will surprise you — and understanding exactly why is the clearest entry point to Bayes' Theorem.
Bayes' Theorem tells you how to update beliefs rationally when you see new evidence. It is the mathematical foundation of Bayesian classifiers, probabilistic inference in ML, and any reasoning that involves combining prior knowledge with observed data.
The Setup: Conditional Probability
is read "the probability of A given B" — the probability of A, given that B is already known to have occurred.
For the spam classifier: is the probability an email is actually spam, given the classifier flagged it. This is what you care about when deciding whether to move an email to the junk folder.
The fundamental definition:
The Theorem
Or in ML terms:
Posterior : Your updated belief about A after seeing evidence B. The probability an email is spam, given it was flagged.
Likelihood : How probable would this evidence be if A were true? The probability of flagging, given the email is spam — this is the classifier's recall.
Prior : Your belief about A before seeing this evidence. The base rate of spam in your inbox.
Evidence : Total probability of observing this evidence across all scenarios — flagged by the classifier, whether or not truly spam.
The Spam Classifier Worked Example
Suppose 30% of emails are spam (). The classifier has:
- Recall (sensitivity):
- Specificity: , so
What is ?
Step 1: Prior:
Step 2: Likelihood:
Step 3: Calculate the evidence — total probability of being flagged:
Step 4: Apply Bayes' Theorem:
Given a flag, there is an 80.3% probability the email is actually spam. This is the posterior — the precision of the classifier at this base rate.
| Phase | Formula | Values | Result |
|---|---|---|---|
| Prior | 30% spam base rate | ||
| Likelihood | classifier recall | ||
| False positive rate | |||
| Evidence | marginalizing over both cases | ||
| Posterior | Bayes numerator / denominator |
Now consider what happens when spam is rarer — say only 5% of emails are spam ():
Only a 33% probability of spam despite a flag. This is the base rate fallacy: ignoring the prior turns a classifier that seems accurate into one that is mostly wrong in low-prevalence settings.
How Belief Updating Works
The Bayesian process iterates: today's posterior becomes tomorrow's prior.
In production, you can update your spam classifier's posterior each time you receive user feedback. An email the classifier flagged, but the user moved back to inbox, updates your estimate of the false positive rate. Yesterday's posterior becomes today's prior for the next email.
Naive Bayes Classifier
For text classification, Bayes' Theorem becomes:
The "naive" assumption: features (words) are conditionally independent given the class. This rarely holds literally, but the classifier works surprisingly well despite the violation.
import numpy as np
def bayesian_spam_posterior(word_probs_given_spam, spam_prior=0.30, specificity=0.90):
"""
word_probs_given_spam: dict of {word: P(word in email | email is spam)}
Returns posterior P(spam | these words present)
"""
p_words_given_spam = np.prod(list(word_probs_given_spam.values()))
p_words_given_not_spam = np.prod([1 - p for p in word_probs_given_spam.values()])
p_spam = spam_prior
p_not_spam = 1 - spam_prior
numerator = p_words_given_spam * p_spam
denominator = numerator + p_words_given_not_spam * p_not_spam
return numerator / denominator
# An email containing "free", "win", "prize"
word_spam_probs = {"free": 0.80, "win": 0.75, "prize": 0.85}
posterior = bayesian_spam_posterior(word_spam_probs, spam_prior=0.30)
print(f"P(spam | 'free', 'win', 'prize'): {posterior:.4f}")
# Same email at lower base rate
posterior_low_base = bayesian_spam_posterior(word_spam_probs, spam_prior=0.05)
print(f"P(spam | same words, 5% base rate): {posterior_low_base:.4f}")P(spam | 'free', 'win', 'prize'): 0.9933
P(spam | same words, 5% base rate): 0.8801
Bayesian vs Frequentist for ML Evaluation
In frequentist hypothesis testing (posts 3-9), parameters are fixed unknowns and probability statements only describe long-run frequencies of data. In Bayesian inference, you assign probability distributions to parameters themselves, expressing genuine uncertainty:
The posterior distribution for model accuracy given validation data is the Bayesian answer to "what is this model's true accuracy?" Unlike a frequentist confidence interval, a 95% Bayesian credible interval has the direct interpretation: "there is a 95% probability the true accuracy is in this range" — conditional on the prior being correct.
For A/B testing at scale, Bayesian approaches let you compute directly — the quantity you actually care about, which frequentist tests cannot provide.
Conjugate priors keep computation tractable. For a classifier accuracy:
- Beta prior Bernoulli likelihood Beta posterior
- If you have run 100 validation examples with 87 correct, and your prior is Beta(1,1) (uniform), the posterior is Beta(88, 14)
The Prosecutor's Fallacy in Model Evaluation
A common error in data science: "The probability of getting this accuracy by chance is 0.01%, so the model is definitely good." This confuses with . A highly unlikely result under random guessing does not mean the model is reliable — it might have overfit, or the test set might be unrepresentative.
Related Concepts
Bayes' Theorem connects backward to conditional probability and the definition of likelihood — concepts introduced here. It connects forward to Bayesian model comparison, where you compute the probability that one model specification is better than another given the data, not just whether their means differ. The base rate fallacy it reveals is exactly the prior-neglect problem in frequentist testing: p-values compute and ignore . Understanding both perspectives makes you a more effective practitioner than knowing only one.
Honest Limitations
Bayesian inference requires specifying a prior. In practice, priors for ML model parameters are often diffuse (uninformative) or chosen for computational convenience rather than genuine prior knowledge. The resulting posteriors can be sensitive to prior choice, especially with small datasets. When priors are genuinely uninformative and samples are large, Bayesian and frequentist conclusions converge. The choice of framework matters most in small-data regimes and when you have real domain knowledge to encode.
Test Your Understanding
- Your classifier has 99% recall and 80% specificity. At a base rate of 1% (rare fraud detection), compute . Why is the answer lower than 50% despite 99% recall?
- An ML team reports that their model correctly classifies 92% of positive examples. A business analyst concludes: "So 92% of what the model flags as positive are actually positive." What mistake is being made, and what additional information do you need?
- For the spam example with 30% base rate, if the first email is flagged (posterior = 80.3%), you use this as the prior for the next email. The next email is not flagged (). What is the new posterior?
- Explain why the Naive Bayes classifier's "naive" assumption (feature independence) is violated for text data, and why the classifier often works well despite this violation.
- A frequentist A/B test gives for a feature improvement. A Bayesian analysis with a uniform prior gives . Are these the same result expressed differently, or fundamentally different quantities? What would make them diverge?