~/blog

Multiplication Rule

Apr 11, 2026•9 min read•By Mohammed Vasim

StatisticsMathData Science

A binary classifier produces two outputs: a predicted label and a confidence score. In practice, a prediction is only useful if two things are true simultaneously — the model is confident and the prediction is correct. The addition rule handles "at least one of these," but "both of these" requires a different tool: the multiplication rule. How you multiply depends on whether the two events influence each other, and that distinction turns out to matter enormously in ML applications.

The Anchor: A Defect Detector on a Sensor Pipeline

A factory quality control system uses a sensor to detect product defects. We have 10,000 inspected units:

600 units have actual defects: P(defect) = 0.06
The sensor fires on 550 units: P(sensor fires) = 0.055
480 units have a defect and the sensor fires: P(defect ∩ sensor fires) = 0.048

This dataset appears in every calculation below.

Independent Events: No Influence Between Them

Two events are independent if knowing the outcome of one tells you nothing about the other. The probability of both happening is just the product of their individual probabilities:

$P (A \cap B) = P (A) \times P (B)$

In our factory context, suppose we run the same inspection process on two separate production lines and the lines have no shared components or operators. A defect on line 1 tells you nothing about defects on line 2. What is the probability that both lines produce a defective unit?

$P (line 1 defect) = 0.06, P (line 2 defect) = 0.06$

$P (both defective) = 0.06 \times 0.06 = 0.0036$

About 0.36%. Since the lines are independent, the sample space is simply the Cartesian product of outcomes and we multiply directly.

Query 1 — sensor fires given defect (sensitivity):

From our anchor: 480 defective units, sensor fires on 480 of them.

text

P(sensor fires | defect) = P(sensor fires ∩ defect) / P(defect)
                         = 0.048 / 0.06 = 0.80

Knowing the unit is defective, the sensor fires 80% of the time. Conditioning on "defect" increases the probability of "sensor fires" (compared to the baseline 0.055).

Query 2 — defect given sensor fires (precision-like):

text

P(defect | sensor fires) = P(defect ∩ sensor fires) / P(sensor fires)
                         = 0.048 / 0.055 = 0.873

Given the sensor fired, 87.3% of those units are actually defective. Conditioning on "sensor fires" dramatically increases the probability of "defect" versus the unconditional baseline of 0.06.

These two conditional probabilities, P(sensor fires | defect) and P(defect | sensor fires), are numerically different — 0.80 vs 0.87. They answer different questions. Confusing them is a common error in production ML.

Bayes' Theorem

Bayes' theorem is derived directly from the multiplication rule. Start from two ways to write the joint probability:

text

P(A ∩ B) = P(A) × P(B|A)   [multiply from A's side]
P(B ∩ A) = P(B) × P(A|B)   [multiply from B's side]

Since P(A ∩ B) = P(B ∩ A), set them equal and rearrange:

text

P(B) × P(A|B) = P(A) × P(B|A)
P(A|B) = P(A) × P(B|A) / P(B)

This is Bayes' theorem. It lets you update a prior probability P(A) using evidence B.

The denominator via total probability: P(B) is often unknown directly. Use the law of total probability to expand it:

text

P(B) = P(B|A) × P(A) + P(B|Aᶜ) × P(Aᶜ)

This sums over all ways B can happen — through A and through not-A.

Full posterior computation — fraud detection:

Using specific fraud domain values:

Prior: P(fraudulent) = 0.02
Likelihood: P(flagged | fraudulent) = 0.95 — the model catches 95% of real fraud
Likelihood: P(flagged | legitimate) = 0.05 — the model falsely flags 5% of legitimate transactions

Step 1 — law of total probability for the denominator:

text

P(flagged) = P(flagged | fraudulent) × P(fraudulent) + P(flagged | legitimate) × P(legitimate)
           = 0.95 × 0.02 + 0.05 × 0.98
           = 0.019 + 0.049
           = 0.068

Step 2 — posterior:

text

P(fraudulent | flagged) = P(flagged | fraudulent) × P(fraudulent) / P(flagged)
                        = 0.95 × 0.02 / 0.068
                        = 0.019 / 0.068
                        = 0.279

Even though the model catches 95% of fraud, only 27.9% of flagged transactions are actually fraudulent. Why? Because fraud is rare (P=0.02) and the 5% false positive rate, applied to the vast majority of legitimate transactions, generates many false alarms.

The confusion to avoid: P(flagged | fraudulent) = 0.95 is the recall of the model. P(fraudulent | flagged) = 0.279 is the precision. These are not the same number and they answer different questions. Treating recall as precision (assuming "our model catches 95% of fraud, so when it flags something there's a 95% chance it's fraud") is the prosecutor's fallacy applied to ML — it ignores the base rate.

When Dependency Compounds

Extend the pipeline: a unit must pass two sensor stages in sequence. Stage 1 catches 80% of defects; stage 2, given stage 1 missed it, catches 60% of remaining defects. P(defect escapes both stages) = P(miss stage 1) × P(miss stage 2 | missed stage 1) = 0.20 × 0.40 = 0.08. The dependency compounds: stage 2 only sees what stage 1 missed, so its conditional probability is not the same as its marginal probability. Three-stage pipelines, ensemble model agreements, and multi-step authentication flows all follow the same chained multiplication logic.

Probability in Machine Learning

Precision, Recall, and F1 as conditional probabilities:

Metric	Formula	Probability form
Precision	TP / (TP + FP)	P(truly positive \| predicted positive)
Recall	TP / (TP + FN)	P(predicted positive \| truly positive)
F1	2 × Precision × Recall / (P + R)	Harmonic mean of two conditional probabilities

Precision is "given the model said yes, how often is it right?" Recall is "given the truth is yes, how often does the model say so?" A model can have recall=0.95 and precision=0.05 simultaneously — this happens when the base rate is very low and the model flags too aggressively. F1 is zero whenever either component is zero (harmonic mean has this property), which is why it penalizes imbalanced precision/recall more severely than the arithmetic mean would.

Joint probability and feature independence:

Two features X and Y are independent if P(X ∩ Y) = P(X) × P(Y). Their joint distribution factorizes and knowing X gives no information about Y. Mutual information measures statistical dependence: it is exactly zero iff features are independent, and positive otherwise.

Naive Bayes exploits the multiplication rule for independent events: it assumes all features are conditionally independent given the class label, so:

text

P(x₁, x₂, ..., xₙ | y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)

This is the multiplication rule for independent events applied n times. The "naive" part is that this independence assumption is usually violated in practice — yet the classifier often works well because the class posterior P(y | x₁,...,xₙ) can still be correctly ranked even when the probabilities themselves are miscalibrated.

Law of total probability and marginalization:

To compute P(model fails overall) when the data has two subgroups:

text

P(fails) = P(fails | easy examples) × P(easy) + P(fails | hard examples) × P(hard)

You marginalize over the subgroup variable by weighting each conditional probability by its prior. This is why aggregate accuracy metrics can be misleading when class distributions differ between training and test — the model's failure rate computed on training proportions may not match the deployment distribution.

python

# Full Bayes computation from the anchor
p_fraud = 0.02
p_flagged_given_fraud = 0.95
p_flagged_given_legit = 0.05
p_legit = 1 - p_fraud

# Law of total probability
p_flagged = p_flagged_given_fraud * p_fraud + p_flagged_given_legit * p_legit

# Posterior
p_fraud_given_flagged = (p_flagged_given_fraud * p_fraud) / p_flagged

print(f"P(flagged) = {p_flagged:.4f}")
print(f"P(fraud | flagged) = {p_fraud_given_flagged:.4f}")
print()
# Model evaluation: precision vs recall
p_flagged_given_fraud_is_recall = p_flagged_given_fraud
precision = p_fraud_given_flagged
print(f"Recall    = P(flagged|fraud) = {p_flagged_given_fraud_is_recall:.2f}")
print(f"Precision = P(fraud|flagged) = {precision:.4f}")
print(f"These are NOT the same: {p_flagged_given_fraud_is_recall:.2f} vs {precision:.4f}")

text

P(flagged) = 0.0680
P(fraud | flagged) = 0.2794

Recall    = P(flagged|fraud) = 0.95
Precision = P(fraud|flagged) = 0.2794
These are NOT the same: 0.95 vs 0.2794

The addition rule is the prerequisite — you need to be comfortable with P(A ∪ B) before thinking about P(A ∩ B), because the two formulas are complementary views of combining events. The multiplication rule directly enables Bayes' theorem: P(defect | fires) = P(fires | defect) × P(defect) / P(fires), which is just the multiplication rule rearranged. It also underlies the chain rule of probability, used in language model probability calculations and graphical models.

Honest Limitations

Independence is an assumption, not a given. In this sensor example it was clear the events were dependent, but in high-dimensional ML data — where you might have dozens of correlated features — determining which pairs of events are independent is genuinely hard. Naive Bayes classifiers famously assume feature independence even when it is violated, and yet often work well in practice. The multiplication rule is precise; the independence assumption it sometimes relies on is where careful modeling judgment is required.

Test Your Understanding

A model achieves 80% precision (P(truly positive | predicted positive) = 0.80) and predicts positive on 15% of all inputs. What fraction of all inputs are both predicted positive and truly positive?
Two sensor modules on separate hardware share no components. P(module A fails) = 0.01, P(module B fails) = 0.02. What is P(both fail)? What assumption makes this calculation valid?
From the anchor dataset: what is P(sensor fires | no defect)? Show the calculation using only the counts given, and explain what this value represents in practice.
If P(defect) = 0.06 and P(sensor fires | defect) = 0.80, but P(sensor fires | no defect) = 0.01, verify the independence assumption fails by computing P(sensor fires) two ways.
A pipeline has three binary classifiers in sequence, each independent with accuracy 0.90. What is the probability all three correctly classify the same input? How does accuracy compound as you add stages?

Multiplication Rule

The Anchor: A Defect Detector on a Sensor Pipeline

Independent Events: No Influence Between Them

Bayes' Theorem

When Dependency Compounds

Probability in Machine Learning

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Multiplication Rule

The Anchor: A Defect Detector on a Sensor Pipeline

Independent Events: No Influence Between Them

Bayes' Theorem

When Dependency Compounds

Probability in Machine Learning

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment