← View series: machine learning
~/blog
Bayes Theorem
Most classifiers learn a direct mapping from features to class labels. Naive Bayes takes a different route: it models how each class generates features, then uses probability theory to invert the question — given these features, how likely is each class? The inversion is Bayes theorem, and understanding it precisely is worth the time before touching any classifier.
Anchor problem: A medical test for a rare disease. Given a positive test result, what is the actual probability of having the disease?
# Disease prevalence (prior): 1% of population
# Test sensitivity: 95% — if you have disease, test is positive 95% of the time
# Test specificity: 90% — if you don't have disease, test is negative 90% of the time
# Test false positive rate: 1 - specificity = 10%
P_disease = 0.01 # P(D)
P_no_disease = 0.99 # P(¬D)
P_pos_given_D = 0.95 # P(+|D) — sensitivity
P_pos_given_noD = 0.10 # P(+|¬D) — false positive rateBayes Theorem — Statement
In English: the probability of A given B equals the probability of B given A, times our prior belief in A, divided by the overall probability of B.
For the disease problem:
- A = "has disease" (D)
- B = "positive test result" (+)
Each term has a name:
| Term | Notation | Meaning in This Problem |
|---|---|---|
| Posterior | Probability of disease given positive test — what we want | |
| Likelihood | Probability of positive test if you have the disease | |
| Prior | Base rate of disease in the population — before seeing the test | |
| Evidence | Overall probability of testing positive (from all sources) |
Step 1: The Prior
— only 1 in 100 people in the population has this disease.
This number does enormous work. Before you walk into the testing clinic, there is a 99% chance you don't have the disease. A test result, no matter how accurate, must shift you away from 99% healthy — not drag you to "probably sick." Many people (and clinical practitioners) intuitively skip this step and assume a positive test implies high likelihood of disease.
Step 2: The Likelihood
— the test catches 95% of sick patients. This is the test's sensitivity.
— 10% of healthy people also test positive. This is the false positive rate (1 − specificity).
Step 3: The Marginal (Total Probability of Testing Positive)
By the law of total probability — positive tests come from two sources: sick people who test positive, and healthy people who test positive:
Only 10.85% of all people tested will test positive. The denominator is dominated by the second term: healthy people who falsely test positive.
Step 4: The Posterior
Even with a positive test, there is only an 8.76% chance of actually having the disease.
The positive test raised the probability from 1% (prior) to 8.76% (posterior) — an 8.76× update. But 91.24% of positive tests are still false positives. This is not a flaw in the test; it is the mathematical consequence of a low base rate.
P_D = 0.01
P_nD = 0.99
P_pos_D = 0.95
P_pos_nD = 0.10
P_pos = P_pos_D * P_D + P_pos_nD * P_nD
P_D_given_pos = (P_pos_D * P_D) / P_pos
print(f"P(+) = {P_pos:.4f}")
print(f"P(D|+) = {P_D_given_pos:.4f} = {P_D_given_pos*100:.2f}%")
print(f"P(¬D|+) = {1-P_D_given_pos:.4f} = {(1-P_D_given_pos)*100:.2f}%")P(+) = 0.1085
P(D|+) = 0.0876 = 8.76%
P(¬D|+) = 0.9124 = 91.24%
Population Counting — Making This Concrete
Instead of fractions, imagine 10,000 people screened:
| Group | Count | Test Result |
|---|---|---|
| Actually have disease | 95 test positive (true positives) | |
| 5 test negative (false negatives) | ||
| Actually healthy | 990 test positive (false positives) | |
| 8910 test negative (true negatives) | ||
| Total positive tests | 95 + 990 = 1085 |
— same answer, now visually clear.
<line x1="240" y1="46" x2="120" y2="96" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="320" y1="46" x2="440" y2="96" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="50" y="96" width="140" height="36" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="120" y="118" text-anchor="middle" font-size="10" font-weight="bold" fill="#92400e">100 have disease</text>
<rect x="370" y="96" width="140" height="36" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="440" y="118" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">9,900 healthy</text>
<line x1="90" y1="132" x2="60" y2="182" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="150" y1="132" x2="180" y2="182" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="410" y1="132" x2="380" y2="182" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="470" y1="132" x2="500" y2="182" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="20" y="182" width="80" height="50" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="2"/>
<text x="60" y="203" text-anchor="middle" font-size="10" font-weight="bold" fill="#1e40af">95 TP</text>
<text x="60" y="218" text-anchor="middle" font-size="8" fill="#1e40af">true positive</text>
<text x="60" y="228" text-anchor="middle" font-size="8" fill="#1e40af">(P+|D)=0.95</text>
<rect x="140" y="182" width="80" height="50" rx="4" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1.5"/>
<text x="180" y="203" text-anchor="middle" font-size="10" fill="#64748b">5 FN</text>
<text x="180" y="218" text-anchor="middle" font-size="8" fill="#64748b">missed</text>
<rect x="340" y="182" width="80" height="50" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="2"/>
<text x="380" y="203" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">990 FP</text>
<text x="380" y="218" text-anchor="middle" font-size="8" fill="#991b1b">false alarm</text>
<text x="380" y="228" text-anchor="middle" font-size="8" fill="#991b1b">(P+|¬D)=0.10</text>
<rect x="460" y="182" width="80" height="50" rx="4" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1.5"/>
<text x="500" y="203" text-anchor="middle" font-size="10" fill="#64748b">8910 TN</text>
<text x="500" y="218" text-anchor="middle" font-size="8" fill="#64748b">correct</text>
<rect x="15" y="245" width="420" height="28" rx="4" fill="#eff6ff" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,2"/>
<text x="225" y="264" text-anchor="middle" font-size="10" font-weight="bold" fill="#1e40af">All positives: 95 + 990 = 1085 → P(D|+) = 95/1085 = 8.76%</text>
The blue box encloses all 1,085 positive tests. Only 95 of them (the blue TP box, top-left) are true positives. The red FP box contributes 990 — more than 10× the true positives.
Effect of Changing the Prior
The posterior depends heavily on the prior. A test with the same sensitivity (95%) and specificity (90%) gives dramatically different posteriors depending on disease prevalence:
priors = [0.001, 0.01, 0.05, 0.10, 0.50]
print(f"{'Prior P(D)':>12} {'P(+)':>8} {'Posterior P(D|+)':>18}")
for P_D in priors:
P_pos = 0.95*P_D + 0.10*(1-P_D)
posterior = (0.95 * P_D) / P_pos
print(f"{P_D:>12.3f} {P_pos:>8.4f} {posterior:>18.4f} ({posterior*100:.1f}%)") Prior P(D) P(+) Posterior P(D|+)
0.001 0.1009 0.0094 (0.9%)
0.010 0.1085 0.0876 (8.8%)
0.050 0.1425 0.3333 (33.3%)
0.100 0.1750 0.5143 (51.4%)
0.500 0.5250 0.9048 (90.5%)
| Prior P(D) | Posterior P(D|+) |
|---|---|
| 0.1% | 0.9% |
| 1% | 8.8% |
| 5% | 33.3% |
| 10% | 51.4% |
| 50% | 90.5% |
<text x="65" y="213" font-size="8" fill="#64748b">0.1%</text>
<text x="160" y="213" font-size="8" fill="#64748b">1%</text>
<text x="240" y="213" font-size="8" fill="#64748b">5%</text>
<text x="320" y="213" font-size="8" fill="#64748b">10%</text>
<text x="430" y="213" font-size="8" fill="#64748b">50%</text>
<text x="48" y="203" text-anchor="end" font-size="8" fill="#64748b">0%</text>
<text x="48" y="130" text-anchor="end" font-size="8" fill="#64748b">50%</text>
<text x="48" y="58" text-anchor="end" font-size="8" fill="#64748b">90%</text>
<polyline points="68,198 168,184 248,136 328,100 440,28" fill="none" stroke="#3b82f6" stroke-width="2.5"/>
<circle cx="68" cy="198" r="4" fill="#3b82f6"/>
<circle cx="168" cy="184" r="5" fill="#f59e0b" stroke="#f59e0b"/>
<circle cx="248" cy="136" r="4" fill="#3b82f6"/>
<circle cx="328" cy="100" r="4" fill="#3b82f6"/>
<circle cx="440" cy="28" r="4" fill="#3b82f6"/>
<line x1="168" y1="15" x2="168" y2="200" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,3"/>
<text x="170" y="170" font-size="8" fill="#f59e0b">our example</text>
<text x="170" y="180" font-size="8" fill="#f59e0b">1% → 8.8%</text>
At prior = 0.1%, a positive test barely moves the needle (0.9% posterior). At prior = 50%, a positive test is almost definitive (90.5%). The test sensitivity and specificity are fixed — the prior does the heavy lifting.
Connecting to ML: Generative vs Discriminative
Bayes theorem is the foundation of generative classifiers: models that explicitly learn (how features are distributed within each class) and (class prior), then infer the class label via:
Naive Bayes, Linear Discriminant Analysis, and Hidden Markov Models are all generative.
Discriminative classifiers (logistic regression, SVM, random forests) skip the generative model and directly learn from training data. They don't need to know how features were generated — they just need to learn the decision boundary.
| Approach | What it models | Examples |
|---|---|---|
| Generative | and → derives | Naive Bayes, LDA, HMM |
| Discriminative | directly | Logistic Regression, SVM, Neural Nets |
Generative models require stronger assumptions but generalize better with less data. Discriminative models are more flexible but need more samples to estimate the decision boundary well.
Test Your Understanding
-
The posterior . After a second independent positive test, what is the new posterior? Use the first posterior (8.76%) as the new prior for the second test — compute ).
-
Test specificity increases from 90% to 99% (false positive rate drops from 10% to 1%). Recompute for the original disease prevalence of 1%. How does this compare to 8.76%?
-
In the population of 10,000: 95 true positives and 990 false positives. If you screen only a high-risk sub-population where prevalence is 10% (instead of 1%), how many true and false positives would you expect? What is in this sub-population?
-
The formula assumes independent class posterior computation. If we have 3 classes and compute the un-normalized posterior for each, how do we normalize to get probabilities summing to 1?
-
Generative classifiers model , which requires specifying how features are distributed. Naive Bayes assumes Gaussian or Multinomial distributions. What goes wrong if the actual feature distribution is heavily skewed and you use a Gaussian assumption?