~/blog

Addition Rule

Apr 11, 2026•7 min read•By Mohammed Vasim

StatisticsMathData Science

Binary classifiers make two kinds of errors: false positives and false negatives. A fraud detection model might flag a transaction as fraudulent (false positive) or miss an actual fraud (false negative). When a product manager asks "what fraction of transactions does the model flag or actually fraud?", they're asking about the union of two events. You can't just add the two probabilities — if a transaction is both actually fraudulent and flagged, you'd count it twice. That double-counting problem is exactly what the addition rule solves.

The Anchor: A Binary Classifier on Fraud Data

Suppose we have 10,000 test transactions. The model's confusion matrix gives us:

400 transactions are actually fraudulent: P(fraud) = 400/10,000 = 0.04
500 transactions are flagged by the model: P(flagged) = 500/10,000 = 0.05
200 transactions are both actually fraudulent and flagged: P(fraud ∩ flagged) = 200/10,000 = 0.02

We'll use this dataset for every calculation, every diagram, and every code snippet in this post.

Sample Space and Events

The sample space Ω is the set of all possible outcomes. An event is any subset of Ω.

From our anchor:

Ω = all 10,000 test transactions
A = {transactions that are actually fraudulent} — |A| = 400
B = {transactions flagged by the model} — |B| = 500
A ∩ B = {transactions that are fraudulent and flagged} — |A ∩ B| = 200
A ∪ B = {transactions that are fraudulent or flagged or both} — |A ∪ B| = ?
Aᶜ = {transactions that are not fraudulent} — |Aᶜ| = 9,600

Notation: P(A) is the probability of event A, P(Aᶜ) is the complement, P(A ∩ B) is the intersection (both), P(A ∪ B) is the union (at least one).

Complement Rule

The complement of event A is everything in Ω that is not in A. Together, A and Aᶜ partition the entire sample space:

$P (A) + P (A^{c}) = 1 P (A^{c}) = 1 - P (A)$

Applied to our anchor:

$P (not fraudulent) = 1 - P (fraudulent) = 1 - 0.04 = 0.96$

DS use case: "What fraction of transactions are correctly processed (not fraudulent, or fraudulent and caught)?" is often easier to compute as 1 minus the miss rate. More generally: P(at least one prediction in a batch is wrong) = 1 − P(all n predictions correct) = 1 − 0.9ⁿ. Computing the complement is often simpler than computing the event directly.

The Formula

For any two events A and B:

$P (A \cup B) = P (A) + P (B) - P (A \cap B)$

The subtraction removes the overlap we counted once in P(A) and once again in P(B). Without it, transactions that are both fraudulent and flagged inflate our estimate.

Applied to our fraud dataset:

$P (fraud \cup flagged) = 0.04 + 0.05 - 0.02 = 0.07$

Seven percent of transactions are either actually fraudulent, flagged by the model, or both.

When Events Don't Overlap: Mutually Exclusive

Sometimes two events genuinely cannot occur on the same transaction. Consider flagging categories in a simpler rule-based system: a transaction is classified as either "high-amount anomaly" or "foreign-country anomaly" — never both, by design. These are mutually exclusive events.

When $P (A \cap B) = 0$ , the formula simplifies:

$P (A \cup B) = P (A) + P (B)$

From our dataset, suppose the model's 500 flagged transactions break into two non-overlapping alert types:

P(high-amount alert) = 300/10,000 = 0.03
P(foreign-country alert) = 200/10,000 = 0.02
P(both alert types) = 0 (mutually exclusive by design)

$P (any alert) = 0.03 + 0.02 = 0.05$

Mutually exclusive events are the simpler special case. The general formula always works — it just happens that the intersection is zero here.

When Events Can Overlap: Non-Mutually Exclusive

Back to the realistic scenario: "fraud" and "flagged" can overlap, and usually do when a model has any accuracy at all. Most interesting ML questions involve overlapping events — a user who clicks an ad and also converts, a sensor that fires and the fault is real, a prediction that is both high-confidence and correct.

The formula stays the same: P(A) + P(B) − P(A ∩ B). The intersection is no longer zero, so you cannot skip that term.

Easy mistake: Once you know events can overlap, the temptation is to just add P(fraud) + P(flagged) = 0.04 + 0.05 = 0.09. That gives the wrong answer because the 200 transactions that are both fraudulent and flagged get counted in P(fraud) and again in P(flagged). You must subtract P(fraud ∩ flagged) = 0.02 to count them exactly once. The subtraction is not optional when events are non-mutually exclusive.

Trace Table

Phase	Formula	Values	Result
P(fraud)	transactions / total	400 / 10,000	0.04
P(flagged)	transactions / total	500 / 10,000	0.05
P(fraud ∩ flagged)	both / total	200 / 10,000	0.02
P(fraud ∪ flagged)	P(A) + P(B) − P(A∩B)	0.04 + 0.05 − 0.02	0.07

Python Implementation

python

fraud_total = 10_000
fraud_count = 400
flagged_count = 500
both_count = 200

prob_fraud = fraud_count / fraud_total
prob_flagged = flagged_count / fraud_total
prob_both = both_count / fraud_total

prob_union = prob_fraud + prob_flagged - prob_both
print(f"P(fraud or flagged) = {prob_union:.4f}")

text

P(fraud or flagged) = 0.0700

Three or More Events

The pattern extends to three events:

$P (A \cup B \cup C) = P (A) + P (B) + P (C) - P (A \cap B) - P (A \cap C) - P (B \cap C) + P (A \cap B \cap C)$

In our fraud domain, imagine a third event: P(high-velocity) = 300/10,000 = 0.03 for transactions with unusual frequency. You'd need all pairwise intersections and the triple intersection. This is the inclusion-exclusion principle — it generalizes to any number of events by alternating between adding and subtracting intersections. It does not get cleaner with more events, but the pattern is systematic.

You need basic probability notation and the complement rule before this post makes full sense — understanding that events can overlap requires a firm grip on what a sample space is. The addition rule unlocks Bayes' theorem and the law of total probability, both of which decompose complex probability calculations into manageable pieces using unions and intersections.

Honest Limitations

The addition rule is only as reliable as your estimate of P(A ∩ B). In the fraud example, we knew the confusion matrix precisely — that's unusual. In production, estimating how two noisy signals co-occur requires enough labeled data for accurate joint counts. With small datasets, the intersection estimate can be the weakest link in the calculation.

Test Your Understanding

A recommendation model flags 8% of products as "trending" and 5% as "seasonal." If 2% are flagged as both, what fraction of products receive at least one flag?
A model predicts two non-overlapping error classes: type-A errors at 3% and type-B errors at 4%. What is P(any error)? Why can you use the simplified formula here?
If P(fraud) = 0.04 and P(flagged) = 0.05 but the events are independent, what would P(fraud ∩ flagged) be, and how would that change P(fraud ∪ flagged)?
A spam filter and a phishing detector both process the same emails. P(spam) = 0.12, P(phishing) = 0.06, P(spam ∩ phishing) = 0.03. What fraction of emails are caught by at least one filter? What fraction are caught by the spam filter but not the phishing detector?
Suppose you extend the model to three alert types. What additional quantities do you need to compute P(at least one alert fires) using inclusion-exclusion?

Addition Rule

The Anchor: A Binary Classifier on Fraud Data

Sample Space and Events

Complement Rule

The Formula

When Events Don't Overlap: Mutually Exclusive

When Events Can Overlap: Non-Mutually Exclusive

Trace Table

Python Implementation

Three or More Events

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Addition Rule

The Anchor: A Binary Classifier on Fraud Data

Sample Space and Events

Complement Rule

The Formula

When Events Don't Overlap: Mutually Exclusive

When Events Can Overlap: Non-Mutually Exclusive

Trace Table

Python Implementation

Three or More Events

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment