← View series: statistics
~/blog
Multiplication Rule
A binary classifier produces two outputs: a predicted label and a confidence score. In practice, a prediction is only useful if two things are true simultaneously — the model is confident and the prediction is correct. The addition rule handles "at least one of these," but "both of these" requires a different tool: the multiplication rule. How you multiply depends on whether the two events influence each other, and that distinction turns out to matter enormously in ML applications.
The Anchor: A Defect Detector on a Sensor Pipeline
A factory quality control system uses a sensor to detect product defects. We have 10,000 inspected units:
- 600 units have actual defects: P(defect) = 0.06
- The sensor fires on 550 units: P(sensor fires) = 0.055
- 480 units have a defect and the sensor fires: P(defect ∩ sensor fires) = 0.048
This dataset appears in every calculation below.
Independent Events: No Influence Between Them
Two events are independent if knowing the outcome of one tells you nothing about the other. The probability of both happening is just the product of their individual probabilities:
In our factory context, suppose we run the same inspection process on two separate production lines and the lines have no shared components or operators. A defect on line 1 tells you nothing about defects on line 2. What is the probability that both lines produce a defective unit?
About 0.36%. Since the lines are independent, the sample space is simply the Cartesian product of outcomes and we multiply directly.
Dependent Events: When One Changes the Other
Now consider the sensor on a single production line. The sensor fires based on what it detects, and what it detects is correlated with whether a defect is present. Knowing "the sensor fired" gives you information about defect probability — these events are dependent.
For dependent events:
The notation P(B | A) is read "probability of B given A has already occurred." This is conditional probability: the probability of B recalculated after restricting attention only to outcomes where A happened.
Why does conditioning change the probability? Because it shrinks your sample space. When you condition on "sensor fires," you're no longer looking at all 10,000 units — you're looking only at the 550 units where the sensor fired. Among those 550, 480 have actual defects. The sample space shrank, and the fraction of defects inside it changed.
Now we can compute: what is the probability that a unit has a defect and the sensor fires?
We need P(sensor fires | defect): of the 600 defective units, 480 triggered the sensor. So P(sensor fires | defect) = 480/600 = 0.80.
This matches the joint count: 480/10,000 = 0.048.
Easy mistake: Assuming independence when events are actually dependent. If we had ignored the dependency and multiplied P(defect) × P(sensor fires) = 0.06 × 0.055 = 0.0033, we'd get a drastically wrong answer — 0.0033 versus the true 0.048. The sensor fires because defects exist, so the two events are strongly positively dependent. Treating correlated signals as independent is one of the most common and consequential errors in applied probability.
The General Form and Its Symmetry
For any two events, you can multiply in either order:
Both expressions equal the same joint probability — they just condition in opposite directions. This symmetry is useful when one direction is easier to look up than the other. It is also the algebraic foundation of Bayes' theorem.
How to Check for Independence
Events A and B are independent if and only if:
Equivalently: P(A | B) = P(A). Knowing B happened does not change A's probability.
For our sensor: P(defect) × P(sensor fires) = 0.06 × 0.055 = 0.0033, but P(defect ∩ sensor fires) = 0.048. They don't match, confirming the events are dependent. The independence test gives you a concrete check rather than a judgment call.
Trace Table
| Phase | Formula | Values | Result |
|---|---|---|---|
| P(defect) | defective / total | 600 / 10,000 | 0.06 |
| P(sensor fires | defect) | fires among defective / defective | 480 / 600 | 0.80 |
| P(defect ∩ sensor fires) | P(defect) × P(fires | defect) | 0.06 × 0.80 | 0.048 |
| Independence check | P(defect) × P(sensor fires) | 0.06 × 0.055 | 0.0033 ≠ 0.048 |
Python Implementation
total_units = 10_000
defective_count = 600
sensor_fires_count = 550
both_count = 480
prob_defect = defective_count / total_units
prob_fires_given_defect = both_count / defective_count
prob_joint_dependent = prob_defect * prob_fires_given_defect
prob_fires = sensor_fires_count / total_units
prob_joint_if_independent = prob_defect * prob_fires
print(f"P(defect and sensor fires) [dependent]: {prob_joint_dependent:.4f}")
print(f"P(defect and sensor fires) [if independent]: {prob_joint_if_independent:.4f}")
print(f"Independence assumption error: {abs(prob_joint_dependent - prob_joint_if_independent):.4f}")P(defect and sensor fires) [dependent]: 0.0480
P(defect and sensor fires) [if independent]: 0.0033
Independence assumption error: 0.0447
When Dependency Compounds
Extend the pipeline: a unit must pass two sensor stages in sequence. Stage 1 catches 80% of defects; stage 2, given stage 1 missed it, catches 60% of remaining defects. P(defect escapes both stages) = P(miss stage 1) × P(miss stage 2 | missed stage 1) = 0.20 × 0.40 = 0.08. The dependency compounds: stage 2 only sees what stage 1 missed, so its conditional probability is not the same as its marginal probability. Three-stage pipelines, ensemble model agreements, and multi-step authentication flows all follow the same chained multiplication logic.
Related Concepts
The addition rule is the prerequisite — you need to be comfortable with P(A ∪ B) before thinking about P(A ∩ B), because the two formulas are complementary views of combining events. The multiplication rule directly enables Bayes' theorem: P(defect | fires) = P(fires | defect) × P(defect) / P(fires), which is just the multiplication rule rearranged. It also underlies the chain rule of probability, used in language model probability calculations and graphical models.
Honest Limitations
Independence is an assumption, not a given. In this sensor example it was clear the events were dependent, but in high-dimensional ML data — where you might have dozens of correlated features — determining which pairs of events are independent is genuinely hard. Naive Bayes classifiers famously assume feature independence even when it is violated, and yet often work well in practice. The multiplication rule is precise; the independence assumption it sometimes relies on is where careful modeling judgment is required.
Test Your Understanding
- A model achieves 80% precision (P(truly positive | predicted positive) = 0.80) and predicts positive on 15% of all inputs. What fraction of all inputs are both predicted positive and truly positive?
- Two sensor modules on separate hardware share no components. P(module A fails) = 0.01, P(module B fails) = 0.02. What is P(both fail)? What assumption makes this calculation valid?
- From the anchor dataset: what is P(sensor fires | no defect)? Show the calculation using only the counts given, and explain what this value represents in practice.
- If P(defect) = 0.06 and P(sensor fires | defect) = 0.80, but P(sensor fires | no defect) = 0.01, verify the independence assumption fails by computing P(sensor fires) two ways.
- A pipeline has three binary classifiers in sequence, each independent with accuracy 0.90. What is the probability all three correctly classify the same input? How does accuracy compound as you add stages?