Back to blog
← View series: statistics
Statistics & Probability

~/blog

Conditional Probability

Jun 21, 202612 min readBy Mohammed Vasim
StatisticsMathData Science

A fraud detection model trained on 1000 test transactions makes 135 positive predictions. 85 of those predictions are correct. So precision — the fraction of positive predictions that are actually fraud — is 85/135 ≈ 63%. The model's recall — the fraction of actual fraud cases it catches — is 85/100 = 85%. These look like very different numbers, and they are. They answer different questions. Both are conditional probabilities, and computing one from the other requires understanding exactly what conditioning does.

Anchor: Confusion matrix from a binary fraud classifier on 1000 test transactions.

Predicted Positive Predicted Negative Total Actual Positive (Fraud) 85 15 100 Actual Negative (Legit) 50 850 900 Total: 135 865 1000

Every calculation in this post uses these 4 cells. No other dataset.


Restricting the Sample Space

Before any formula: what conditioning means geometrically.

Unconditional: P(fraud) = 100/1000 = 0.10. The denominator is all 1000 test cases.

Conditional: "Given the model predicted positive, what is the probability the transaction is actually fraud?" Now the denominator is not 1000 — it's only the 135 predicted-positive cases. The sample space shrank. Inside those 135 cases, 85 are actual fraud.

P(fraud | predicted positive) = 85/135 ≈ 0.630

The operation "given B" means: discard everything outside B and recount. The probability inside the shrunken space is what you're measuring.

QuestionUniverseCountProbability
P(fraud)All 1000 transactions100 fraud0.100
P(fraud | predicted positive)Only 135 predicted positive85 fraud among them0.630
P(predicted positive | fraud)Only 100 actual fraud85 predicted positive among them0.850

Same numerator (85 true positives). Different denominator. Different answer. This is the core asymmetry.


Formal Definition

From the shrinking sample space argument: in the restricted universe of B, the fraction that is also A equals (cases that are both A and B) / (cases that are B). Dividing numerator and denominator by total n:

P(A|B) = P(A ∩ B) / P(B)

This is not an arbitrary formula — it is the mathematical statement of "look only inside B, then ask what fraction is A."

Ω (1000 transactions) A = fraud (A only) B = predicted positive (B only: FP) A∩B TP = 85 B is shaded — conditioning on B shrinks Ω to 135 cases. P(A|B) = amber/blue = 85/135

Four Standard Queries

Every confusion matrix metric is a conditional probability. Each query: what question is being asked → restrict the sample space → compute.

Query 1: P(fraud | predicted positive) — Precision

Restrict to predicted-positive cases (135 total). What fraction are actual fraud?

P(fraud | predicted positive) = P(fraud ∩ PP) / P(PP) = (85/1000) / (135/1000) = 85/135 ≈ 0.630

When the model says "fraud," it's right 63% of the time.

Query 2: P(predicted positive | fraud) — Recall (Sensitivity)

Restrict to actual fraud cases (100 total). What fraction did the model flag?

P(PP | fraud) = P(fraud ∩ PP) / P(fraud) = (85/1000) / (100/1000) = 85/100 = 0.850

The model catches 85% of actual fraud.

Query 3: P(fraud | predicted negative) — False Omission Rate

Restrict to predicted-negative cases (865 total). What fraction are actual fraud?

P(fraud | PN) = P(fraud ∩ PN) / P(PN) = (15/1000) / (865/1000) = 15/865 ≈ 0.017

When the model says "not fraud," only 1.7% of those are actually fraud — the model's misses are rare.

Query 4: P(predicted negative | legitimate) — Specificity

Restrict to actual legitimate cases (900 total). What fraction did the model correctly label negative?

P(PN | legit) = P(legit ∩ PN) / P(legit) = (850/1000) / (900/1000) = 850/900 ≈ 0.944

The model correctly clears 94.4% of legitimate transactions.

python
TP, FP, FN, TN = 85, 50, 15, 850
n = TP + FP + FN + TN

precision   = TP / (TP + FP)
recall      = TP / (TP + FN)
for_rate    = FN / (FN + TN)
specificity = TN / (TN + FP)

print(f"Precision   P(fraud|PP)  = {precision:.3f}")
print(f"Recall      P(PP|fraud)  = {recall:.3f}")
print(f"FOM rate    P(fraud|PN)  = {for_rate:.3f}")
print(f"Specificity P(PN|legit)  = {specificity:.3f}")
Precision P(fraud|PP) = 0.630 Recall P(PP|fraud) = 0.850 FOM rate P(fraud|PN) = 0.017 Specificity P(PN|legit) = 0.944

P(A|B) ≠ P(B|A): The Asymmetry

This is the most consequential fact in conditional probability.

From the four queries: precision = 0.630 and recall = 0.850. They share the same numerator (85 true positives) but have different denominators (135 vs 100). They answer different questions and produce different numbers.

The asymmetry matters in two directions:

Precision << Recall: A model can have recall=0.99 and precision=0.05 simultaneously. This happens when fraud is very rare and the model flags many legitimate transactions. The model catches almost everything (high recall) but most of what it flags is not actually fraud (low precision).

Precision >> Recall: A conservative model that only flags transactions with overwhelming evidence. It's right almost every time it flags (high precision) but misses a lot of actual fraud (low recall).

P(A|B): denominator = B

A B A∩B amber/blue = P(A|B) = 0.630

P(B|A): denominator = A

A B A∩B amber/red = P(B|A) = 0.850

The prosecutor's fallacy: P(DNA match | innocent) might be 0.0001. A prosecutor argues "the probability of innocence is 0.0001." Wrong — they're computing P(innocent | DNA match), which requires knowing the base rate of innocence in the suspect population. The two probabilities are very different quantities. Confusing them can send innocent people to jail.

ML evaluation version: P(predicted positive | fraud) = recall = 0.85. This does not mean 85% of positive predictions are fraud. That's precision = 0.63. Always name which direction you're conditioning.


Chain Rule

The definition P(A|B) = P(A∩B)/P(B) rearranges to:

P(A ∩ B) = P(A|B) × P(B)

This is the chain rule for two events. For three events, apply iteratively:

P(A ∩ B ∩ C) = P(A|B∩C) × P(B|C) × P(C)

Derive step by step:

  1. P(A∩B∩C) = P(A | B∩C) × P(B∩C) [from definition, conditioning on B∩C]
  2. P(B∩C) = P(B|C) × P(C) [from definition, conditioning on C]
  3. Substitute: P(A∩B∩C) = P(A|B∩C) × P(B|C) × P(C)

Verification on anchor:

P(fraud ∩ predicted positive) = P(PP | fraud) × P(fraud) = 0.85 × 0.10 = 0.085

Check: 85/1000 = 0.085 ✓

Probability Tree: Chain Rule start P(fraud)=0.10 P(legit)=0.90 fraud legit P(PP|fraud)=0.85 P(PN|fraud)=0.15 P(PP|legit)=0.056 P(PN|legit)=0.944 fraud∩PP: 0.85×0.10=0.085 fraud∩PN: 0.15×0.10=0.015 legit∩PP: 0.056×0.90=0.050 legit∩PN: 0.944×0.90=0.850

Leaves sum to 1: 0.085+0.015+0.050+0.850 = 1.000 ✓


Independence: Formal Definition

Events A and B are independent if and only if:

P(A|B) = P(A) [conditioning on B gives no new information about A]

Equivalently: P(A ∩ B) = P(A) × P(B) — the joint probability factorizes.

Test on anchor: Is fraud independent of the model's prediction?

P(fraud) = 0.10 P(fraud | predicted positive) = 0.630 0.630 ≠ 0.10 → NOT independent

Correct — a useful model correlates its predictions with truth. A random classifier would have P(fraud | PP) = P(fraud) = 0.10 (knowing the prediction gives no information). That would be a useless, maximally uninformative model.

Conditional independence: Two events can be dependent unconditionally but independent given a third event. Two model features might be correlated unconditionally, but when you condition on the true class label, the correlation vanishes — the features share no information beyond what the class already explains. This is the assumption Naive Bayes exploits:

P(x₁, x₂, ..., xₙ | y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)

Each feature's likelihood is estimated separately under the class label. The "naive" label reflects that this conditional independence assumption is usually wrong — but the posterior ranking is often still correct.


Law of Total Probability

When you cannot compute P(B) directly, decompose it through a partition of the sample space. If {A₁, A₂, ..., Aₖ} partitions Ω (mutually exclusive, exhaustive):

P(B) = Σᵢ P(B|Aᵢ) × P(Aᵢ)

Derivation on anchor: Compute P(predicted positive) without using the column total directly.

The partition: fraud (A₁) and legitimate (A₂).

Fraud: 10% Legitimate: 90% PP∩fraud 0.085 PP∩legit = 0.050 P(PP) = 0.085 + 0.050 = 0.135 — verified: 135/1000 ✓ P(PP) = P(PP | fraud) × P(fraud) + P(PP | legit) × P(legit) = 0.85 × 0.10 + 0.056 × 0.90 = 0.085 + 0.050 = 0.135

Check: 135/1000 = 0.135 ✓

This is how you compute a marginal probability when you only know the conditional probabilities and the partition weights.


Connection to Bayes' Theorem

Bayes' theorem follows directly from the definition of conditional probability applied in both directions:

P(A|B) = P(A ∩ B) / P(B) = [P(B|A) × P(A)] / P(B)

Applied once on the anchor using the law of total probability for the denominator:

P(fraud | PP) = P(PP | fraud) × P(fraud) / P(PP) = 0.85 × 0.10 / 0.135 = 0.085 / 0.135 = 0.630

Matches the direct calculation ✓. The full treatment — prior updating, sequential Bayesian inference, base rate neglect — is in the Bayes' theorem post. Here the key point is that Bayes' theorem is a consequence of the conditional probability definition, not a new axiom.


ML Applications

Precision and recall as conditional probabilities:

Every confusion matrix metric has a conditional probability interpretation. When you report model performance, you're reporting conditional probabilities — always name which direction you're conditioning. Reporting recall when your stakeholder needs precision is a technically correct but operationally misleading answer.

Naive Bayes and class-conditional likelihoods:

A Naive Bayes classifier estimates P(features | class) × P(class) for each class, then takes the argmax. Each term P(xᵢ | y) is a conditional probability estimated from training data: "given the class is y, what is the distribution of feature xᵢ?" The classifier exploits these conditional distributions plus the chain rule to assign class probabilities without ever needing to estimate the joint distribution of all features simultaneously.

Model calibration:

A well-calibrated model satisfies P(y=1 | score=s) ≈ s for all score values s. If the model outputs score=0.9 but P(y=1 | score=0.9) = 0.40 in reality, the model is overconfident — it says "90% confident" but only 40% of those predictions are correct. Calibration is precisely a conditional probability statement about the model's score output.

python
TP, FP, FN, TN = 85, 50, 15, 850
n = TP + FP + FN + TN

precision   = TP / (TP + FP)
recall      = TP / (TP + FN)
for_rate    = FN / (FN + TN)
specificity = TN / (TN + FP)

print(f"Precision    P(fraud|PP)   = {precision:.3f}")
print(f"Recall       P(PP|fraud)   = {recall:.3f}")
print(f"Asymmetry:   {precision:.3f} ≠ {recall:.3f}")
print()
print(f"FOM rate     P(fraud|PN)   = {for_rate:.3f}")
print(f"Specificity  P(PN|legit)   = {specificity:.3f}")
print()
# Law of total probability
p_fraud = (TP + FN) / n
p_legit = (FP + TN) / n
p_pp_given_fraud = TP / (TP + FN)
p_pp_given_legit = FP / (FP + TN)
p_pp = p_pp_given_fraud * p_fraud + p_pp_given_legit * p_legit
print(f"P(PP) via total probability = {p_pp:.3f}  [direct: {(TP+FP)/n:.3f}]")
Precision P(fraud|PP) = 0.630 Recall P(PP|fraud) = 0.850 Asymmetry: 0.630 ≠ 0.850 FOM rate P(fraud|PN) = 0.017 Specificity P(PN|legit) = 0.944 P(PP) via total probability = 0.135 [direct: 0.135]

Conditional probability is the foundation for Bayes' theorem (updating beliefs with evidence), the chain rule of probability (decomposing joint distributions), and Markov chains (where state transitions are conditional on the current state). In ML, it underlies precision, recall, Naive Bayes, model calibration, and every use of posterior probabilities. Independence testing — checking whether P(A|B) = P(A) — connects to feature selection and mutual information.

Honest Limitations

The conditional probability formula P(A|B) = P(A∩B)/P(B) requires P(B) > 0. Conditioning on a zero-probability event is undefined. In practice this means: if the conditioning event is very rare (B has very few samples), the estimate of P(A|B) becomes high-variance and unreliable. With 5 instances of a rare event type in your dataset, P(A | rare event) computed from those 5 cases is a noisy estimate. This is the small denominator problem in conditional probability estimation, and it requires either smoothing, Bayesian priors, or more data.

Test Your Understanding

  1. A spam filter achieves: TP=180, FP=20, FN=30, TN=770. Compute P(spam | predicted spam) and P(predicted spam | spam). Explain in one sentence why these are different.

  2. Using the conditional probability formula, show algebraically why P(A|B) ≠ P(B|A) in general. Under what condition would they be equal?

  3. The chain rule gives P(A∩B∩C) = P(A|B∩C) × P(B|C) × P(C). Use this to compute the probability that three independent sensor alarms all fire simultaneously, where each fires with probability 0.05 independently.

  4. From the anchor confusion matrix, compute P(predicted positive) two ways: (a) directly from column totals, and (b) using the law of total probability. Verify they match.

  5. A model for rare disease detection (prevalence = 1%) has sensitivity (recall) = 0.99 and specificity = 0.95. Compute the precision (positive predictive value) using Bayes' theorem. Why is the result surprising given the high sensitivity and specificity?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment