Back to blog
← View series: machine learning
Machine Learning

~/blog

Bayes Theorem

Jun 26, 20267 min readBy Mohammed Vasim
Machine LearningAIData Science

Most classifiers learn a direct mapping from features to class labels. Naive Bayes takes a different route: it models how each class generates features, then uses probability theory to invert the question — given these features, how likely is each class? The inversion is Bayes theorem, and understanding it precisely is worth the time before touching any classifier.

Anchor problem: A medical test for a rare disease. Given a positive test result, what is the actual probability of having the disease?

python
# Disease prevalence (prior): 1% of population
# Test sensitivity: 95% — if you have disease, test is positive 95% of the time
# Test specificity: 90% — if you don't have disease, test is negative 90% of the time
# Test false positive rate: 1 - specificity = 10%

P_disease        = 0.01   # P(D)
P_no_disease     = 0.99   # P(¬D)
P_pos_given_D    = 0.95   # P(+|D)  — sensitivity
P_pos_given_noD  = 0.10   # P(+|¬D) — false positive rate

Bayes Theorem — Statement

In English: the probability of A given B equals the probability of B given A, times our prior belief in A, divided by the overall probability of B.

For the disease problem:

  • A = "has disease" (D)
  • B = "positive test result" (+)

Each term has a name:

TermNotationMeaning in This Problem
PosteriorProbability of disease given positive test — what we want
LikelihoodProbability of positive test if you have the disease
PriorBase rate of disease in the population — before seeing the test
EvidenceOverall probability of testing positive (from all sources)

Step 1: The Prior

— only 1 in 100 people in the population has this disease.

This number does enormous work. Before you walk into the testing clinic, there is a 99% chance you don't have the disease. A test result, no matter how accurate, must shift you away from 99% healthy — not drag you to "probably sick." Many people (and clinical practitioners) intuitively skip this step and assume a positive test implies high likelihood of disease.

Step 2: The Likelihood

— the test catches 95% of sick patients. This is the test's sensitivity.

— 10% of healthy people also test positive. This is the false positive rate (1 − specificity).

Step 3: The Marginal (Total Probability of Testing Positive)

By the law of total probability — positive tests come from two sources: sick people who test positive, and healthy people who test positive:

Only 10.85% of all people tested will test positive. The denominator is dominated by the second term: healthy people who falsely test positive.

Step 4: The Posterior

Even with a positive test, there is only an 8.76% chance of actually having the disease.

The positive test raised the probability from 1% (prior) to 8.76% (posterior) — an 8.76× update. But 91.24% of positive tests are still false positives. This is not a flaw in the test; it is the mathematical consequence of a low base rate.

python
P_D  = 0.01
P_nD = 0.99
P_pos_D  = 0.95
P_pos_nD = 0.10

P_pos = P_pos_D * P_D + P_pos_nD * P_nD
P_D_given_pos = (P_pos_D * P_D) / P_pos

print(f"P(+)    = {P_pos:.4f}")
print(f"P(D|+)  = {P_D_given_pos:.4f} = {P_D_given_pos*100:.2f}%")
print(f"P(¬D|+) = {1-P_D_given_pos:.4f} = {(1-P_D_given_pos)*100:.2f}%")
P(+) = 0.1085 P(D|+) = 0.0876 = 8.76% P(¬D|+) = 0.9124 = 91.24%

Population Counting — Making This Concrete

Instead of fractions, imagine 10,000 people screened:

GroupCountTest Result
Actually have disease95 test positive (true positives)
5 test negative (false negatives)
Actually healthy990 test positive (false positives)
8910 test negative (true negatives)
Total positive tests95 + 990 = 1085

— same answer, now visually clear.

10,000 people <line x1="240" y1="46" x2="120" y2="96" stroke="#94a3b8" stroke-width="1.5"/> <line x1="320" y1="46" x2="440" y2="96" stroke="#94a3b8" stroke-width="1.5"/> <rect x="50" y="96" width="140" height="36" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="120" y="118" text-anchor="middle" font-size="10" font-weight="bold" fill="#92400e">100 have disease</text> <rect x="370" y="96" width="140" height="36" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="440" y="118" text-anchor="middle" font-size="10" font-weight="bold" fill="#15803d">9,900 healthy</text> <line x1="90" y1="132" x2="60" y2="182" stroke="#94a3b8" stroke-width="1.5"/> <line x1="150" y1="132" x2="180" y2="182" stroke="#94a3b8" stroke-width="1.5"/> <line x1="410" y1="132" x2="380" y2="182" stroke="#94a3b8" stroke-width="1.5"/> <line x1="470" y1="132" x2="500" y2="182" stroke="#94a3b8" stroke-width="1.5"/> <rect x="20" y="182" width="80" height="50" rx="4" fill="#dbeafe" stroke="#3b82f6" stroke-width="2"/> <text x="60" y="203" text-anchor="middle" font-size="10" font-weight="bold" fill="#1e40af">95 TP</text> <text x="60" y="218" text-anchor="middle" font-size="8" fill="#1e40af">true positive</text> <text x="60" y="228" text-anchor="middle" font-size="8" fill="#1e40af">(P+|D)=0.95</text> <rect x="140" y="182" width="80" height="50" rx="4" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1.5"/> <text x="180" y="203" text-anchor="middle" font-size="10" fill="#64748b">5 FN</text> <text x="180" y="218" text-anchor="middle" font-size="8" fill="#64748b">missed</text> <rect x="340" y="182" width="80" height="50" rx="4" fill="#fee2e2" stroke="#ef4444" stroke-width="2"/> <text x="380" y="203" text-anchor="middle" font-size="10" font-weight="bold" fill="#991b1b">990 FP</text> <text x="380" y="218" text-anchor="middle" font-size="8" fill="#991b1b">false alarm</text> <text x="380" y="228" text-anchor="middle" font-size="8" fill="#991b1b">(P+|¬D)=0.10</text> <rect x="460" y="182" width="80" height="50" rx="4" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1.5"/> <text x="500" y="203" text-anchor="middle" font-size="10" fill="#64748b">8910 TN</text> <text x="500" y="218" text-anchor="middle" font-size="8" fill="#64748b">correct</text> <rect x="15" y="245" width="420" height="28" rx="4" fill="#eff6ff" stroke="#3b82f6" stroke-width="1.5" stroke-dasharray="4,2"/> <text x="225" y="264" text-anchor="middle" font-size="10" font-weight="bold" fill="#1e40af">All positives: 95 + 990 = 1085 → P(D|+) = 95/1085 = 8.76%</text>

The blue box encloses all 1,085 positive tests. Only 95 of them (the blue TP box, top-left) are true positives. The red FP box contributes 990 — more than 10× the true positives.

Effect of Changing the Prior

The posterior depends heavily on the prior. A test with the same sensitivity (95%) and specificity (90%) gives dramatically different posteriors depending on disease prevalence:

python
priors = [0.001, 0.01, 0.05, 0.10, 0.50]
print(f"{'Prior P(D)':>12} {'P(+)':>8} {'Posterior P(D|+)':>18}")
for P_D in priors:
    P_pos = 0.95*P_D + 0.10*(1-P_D)
    posterior = (0.95 * P_D) / P_pos
    print(f"{P_D:>12.3f} {P_pos:>8.4f} {posterior:>18.4f} ({posterior*100:.1f}%)")
Prior P(D) P(+) Posterior P(D|+) 0.001 0.1009 0.0094 (0.9%) 0.010 0.1085 0.0876 (8.8%) 0.050 0.1425 0.3333 (33.3%) 0.100 0.1750 0.5143 (51.4%) 0.500 0.5250 0.9048 (90.5%)
Prior P(D)Posterior P(D|+)
0.1%0.9%
1%8.8%
5%33.3%
10%51.4%
50%90.5%
Prior P(D) Posterior P(D|+) <text x="65" y="213" font-size="8" fill="#64748b">0.1%</text> <text x="160" y="213" font-size="8" fill="#64748b">1%</text> <text x="240" y="213" font-size="8" fill="#64748b">5%</text> <text x="320" y="213" font-size="8" fill="#64748b">10%</text> <text x="430" y="213" font-size="8" fill="#64748b">50%</text> <text x="48" y="203" text-anchor="end" font-size="8" fill="#64748b">0%</text> <text x="48" y="130" text-anchor="end" font-size="8" fill="#64748b">50%</text> <text x="48" y="58" text-anchor="end" font-size="8" fill="#64748b">90%</text> <polyline points="68,198 168,184 248,136 328,100 440,28" fill="none" stroke="#3b82f6" stroke-width="2.5"/> <circle cx="68" cy="198" r="4" fill="#3b82f6"/> <circle cx="168" cy="184" r="5" fill="#f59e0b" stroke="#f59e0b"/> <circle cx="248" cy="136" r="4" fill="#3b82f6"/> <circle cx="328" cy="100" r="4" fill="#3b82f6"/> <circle cx="440" cy="28" r="4" fill="#3b82f6"/> <line x1="168" y1="15" x2="168" y2="200" stroke="#f59e0b" stroke-width="1" stroke-dasharray="3,3"/> <text x="170" y="170" font-size="8" fill="#f59e0b">our example</text> <text x="170" y="180" font-size="8" fill="#f59e0b">1% → 8.8%</text>

At prior = 0.1%, a positive test barely moves the needle (0.9% posterior). At prior = 50%, a positive test is almost definitive (90.5%). The test sensitivity and specificity are fixed — the prior does the heavy lifting.

Connecting to ML: Generative vs Discriminative

Bayes theorem is the foundation of generative classifiers: models that explicitly learn (how features are distributed within each class) and (class prior), then infer the class label via:

Naive Bayes, Linear Discriminant Analysis, and Hidden Markov Models are all generative.

Discriminative classifiers (logistic regression, SVM, random forests) skip the generative model and directly learn from training data. They don't need to know how features were generated — they just need to learn the decision boundary.

ApproachWhat it modelsExamples
Generative and → derives Naive Bayes, LDA, HMM
Discriminative directlyLogistic Regression, SVM, Neural Nets

Generative models require stronger assumptions but generalize better with less data. Discriminative models are more flexible but need more samples to estimate the decision boundary well.

Test Your Understanding

  1. The posterior . After a second independent positive test, what is the new posterior? Use the first posterior (8.76%) as the new prior for the second test — compute ).

  2. Test specificity increases from 90% to 99% (false positive rate drops from 10% to 1%). Recompute for the original disease prevalence of 1%. How does this compare to 8.76%?

  3. In the population of 10,000: 95 true positives and 990 false positives. If you screen only a high-risk sub-population where prevalence is 10% (instead of 1%), how many true and false positives would you expect? What is in this sub-population?

  4. The formula assumes independent class posterior computation. If we have 3 classes and compute the un-normalized posterior for each, how do we normalize to get probabilities summing to 1?

  5. Generative classifiers model , which requires specifying how features are distributed. Naive Bayes assumes Gaussian or Multinomial distributions. What goes wrong if the actual feature distribution is heavily skewed and you use a Gaussian assumption?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment