Back to blog
← View series: statistics

~/blog

Correlation vs Causation

Apr 18, 202612 min readBy Mohammed Vasim
StatisticsMathData Science

Models with more parameters tend to achieve higher validation accuracy. The correlation is real and measurable. Does that mean adding more parameters to your model causes better accuracy? If you act on that belief and scale up a model trained on a small dataset, you will likely get worse accuracy — not better. The correlation was true. The causal claim was wrong.

The Core Distinction

Correlation: a statistical association between two variables X and Y. Measured by Pearson r, Spearman ρ, or mutual information. "When X is high, Y tends to be high (or low)." Correlation is a property of observed data.

Causation: X causes Y means: if you intervene on X — change X while holding everything else fixed — Y will change. This is a claim about a mechanism, not a pattern. Causation requires you to reason about what would happen in a world where you actively set X to a specific value.

Four ways correlation arises:

SourceDescriptionML Example
X → Y (direct causation)X genuinely causes YRegularization → prevents overfitting
Y → X (reverse causation)Arrow is backwardHard tasks → require longer training (not: longer training → harder tasks)
Z → X, Z → Y (confounding)Third variable causes bothDataset size → allows bigger models AND → higher accuracy
Random chanceSpurious, especially with many variablesCheese consumption correlated with bedsheet deaths

Confounders: The Hidden Third Variable

A confounder Z creates an observed correlation between X and Y that has no causal basis.

Formal definition: Z is a confounder of the X → Y relationship if:

  • Z causes X (or is associated with X)
  • Z causes Y, independently of X
  • Z is not on the causal pathway X → Z → Y (it is not a mediator)
Confounded (spurious correlation) Causal (after removing confound) Z X Y spurious complexity accuracy dataset size X Y complexity accuracy causal

The Anchor Example: Dataset Size as Confounder

  • X = model parameter count (complexity)
  • Y = validation accuracy
  • Z = dataset size

Large datasets allow larger models to be trained effectively (Z → X). Larger datasets also produce higher accuracy regardless of model complexity (Z → Y). The result: model complexity and accuracy are positively correlated in observational data — but the correlation is largely driven by dataset size.

The within-stratum test: within each dataset size class (small, medium, large), the correlation between complexity and accuracy may be near zero or negative (overfitting in small datasets). The overall correlation appears only because large datasets co-occur with both large models and high accuracy. This is the confounder's signature: aggregate correlation, no within-group correlation.

Reverse Causation

The causal arrow is backward. X and Y are correlated, but Y causes X, not X causes Y.

ML example: model accuracy is correlated with training duration. Does training longer cause higher accuracy? Sometimes. But also: harder tasks (lower baseline accuracy) require longer training — the difficulty causes the training duration. Both directions exist simultaneously in observational data. The correlation does not distinguish them.

How to identify reverse causation: consider the causal mechanism. Does it make physical/logical sense for X to precede and cause Y? Consider temporal order — cause must precede effect. If X and Y are measured at the same time, temporal ordering cannot help.

Spurious Correlations

With enough variables, correlations arise purely by chance — no mechanism, no confounder, just sampling noise.

Expected false positives in high-dimensional feature selection: With p=1,000 features and α=0.05, the number of feature pairs is C(1000, 2) = 499,500. At α=0.05, the expected number of significant correlations from random noise is:

0.05 × 499,500 = 24,975 "significant" correlations in pure noise

Even at Bonferroni-corrected α=0.05/499,500 ≈ 10⁻⁷, many spurious correlations survive in small samples.

Why this matters in ML feature selection: never select features based on correlation with the target without controlling for other variables or applying multiple testing corrections. Spurious correlations pollute feature sets and reduce generalization. The feature that "predicts well" on training data due to spurious correlation will fail at deployment.

Simpson's Paradox

An association that appears in aggregate data reverses when disaggregated by a third variable.

Concrete Numerical Example

Two models evaluated on hard and easy examples:

Hard ExamplesEasy ExamplesOverall
Model A0.70 (n=20)0.90 (n=80)(14+72)/100 = 0.86
Model B0.72 (n=80)0.92 (n=20)(57.6+18.4)/100 = 0.76

Overall: Model A (0.86) appears better than Model B (0.76).

Within each stratum: Model B is better on both hard (0.72 > 0.70) and easy (0.92 > 0.90) examples.

The reversal occurs because Model A was evaluated on mostly easy examples (80 out of 100), while Model B was evaluated on mostly hard examples (80 out of 100). The example difficulty is a confounder. The aggregate comparison is misleading.

Simpson's paradox: aggregate (wrong) vs stratified (correct) Aggregate (misleading) 0.86 Model A 0.76 Model B A looks better ✗ Stratified (correct) Hard examples 0.70 0.72 Easy examples 0.90 0.92 A B (better in both strata)

The rule: the aggregate comparison is misleading because evaluation difficulty is a confounder. The correct comparison controls for difficulty.

Simpson's paradox in ML:

  • Class imbalance: a model with better overall accuracy may be worse on the minority class (the one you actually care about)
  • Multi-dataset evaluation: a model trained on large clean datasets looks better overall but performs worse on the hard cases — the dataset difficulty distribution confounds the aggregate comparison
  • Group fairness: a model with better overall accuracy may be worse for every demographic subgroup if subgroups have different base rates (the famous Berkeley admissions example structure)

Randomized Controlled Experiments: The Solution

The only way to establish causation from data is to intervene:

Randomized Controlled Experiment (RCE):

  1. Randomly assign units to treatment (X=1, new model) or control (X=0, current model)
  2. Random assignment breaks confounding — Z is now independent of X by construction
  3. Compare Y between groups — any systematic difference is causal (up to sampling error)

Why randomization eliminates confounders: in observational data, larger models are trained on larger datasets (Z → X). In a randomized experiment, you randomly assign model sizes to datasets of all sizes. Z is no longer associated with X — the backdoor path Z → X is severed. Any residual correlation between X and Y must be causal (or sampling noise).

This is what A/B testing implements — see the A/B Testing post in this series.

Observational Causal Inference

When randomization is impossible (ethical, economic, or practical constraints), these methods estimate causal effects from observational data:

  1. Matching: pair each treated unit with a similar control unit on observed confounders. Controls for observed confounders; unobserved confounders remain a threat.
  2. Regression discontinuity: exploit a threshold rule (models above a parameter count receive a different training scheme) to compare just-above vs just-below the threshold — the local comparison is approximately randomized.
  3. Difference-in-differences: compare pre/post changes between treated and untreated groups. Requires parallel trends assumption — both groups would have moved identically without treatment.
  4. Instrumental variables: use a variable Z that affects X but affects Y only through X. Z acts as a "natural randomizer" for X.

All observational methods require strong assumptions that cannot be verified from the data alone. They are always weaker than a well-designed RCE.

DAGs: Three Key Structures

Confounder Z (condition on Z) Collider C (do NOT condition on C) Mediator M (do NOT condition — total effect) Z X Y spurious Backdoor path: X←Z→Y Block by conditioning on Z X Y C X→C←Y: C is a collider Conditioning on C opens X↔Y spuriously X M Y X→M→Y: M is a mediator Conditioning on M blocks causal path

Backdoor path: a path from X to Y that goes through a confounder (X ← Z → Y). Block it by conditioning on Z. This is the adjustment set for identifying the causal effect.

Collider: a variable C where two arrows collide (X → C ← Y). Do NOT condition on a collider — conditioning on a collider opens a spurious association between X and Y. Example: conditioning on "hired" when studying gender and skill creates a spurious gender-skill correlation among hired candidates.

Mediator: a variable M on the causal path (X → M → Y). Conditioning on M blocks the causal effect of X on Y through M. If you want the total causal effect of X on Y, do not condition on mediators.

Code

python
import pandas as pd
import numpy as np

# Recreate the Simpson's paradox data
data = {
    'model': ['A']*100 + ['B']*100,
    'difficulty': ['hard']*20 + ['easy']*80 + ['hard']*80 + ['easy']*20,
    'accuracy': (
        [0.70]*20 + [0.90]*80 +   # Model A: 20 hard, 80 easy
        [0.72]*80 + [0.92]*20     # Model B: 80 hard, 20 easy
    )
}
df = pd.DataFrame(data)

# Aggregate comparison (Simpson's paradox — misleading!)
print("Aggregate mean accuracy:")
print(df.groupby('model')['accuracy'].mean().round(4))
print()

# Stratified comparison (correct)
print("Mean accuracy by model AND difficulty:")
print(df.groupby(['model', 'difficulty'])['accuracy'].mean().round(4))
print()

# Group sizes — this reveals why the paradox occurs
print("Sample counts per model and difficulty:")
print(df.groupby(['model', 'difficulty']).size())
Aggregate mean accuracy: model A 0.86 B 0.76 Mean accuracy by model AND difficulty: model difficulty A easy 0.90 hard 0.70 B easy 0.92 hard 0.72 Sample counts per model and difficulty: model difficulty A easy 80 hard 20 B easy 20 hard 80

Model B is better in every stratum. Model A appears better in aggregate only because it was evaluated on more easy examples. The groupby reveals both the paradox and its cause.

Five Rules for ML Practitioners

  1. Correlation is not causation. A feature correlated with the target does not cause outcomes. Do not use correlational findings to make causal claims in reports or product decisions.

  2. Always look for confounders. Before concluding X causes Y: what third variable might cause both X and Y? Draw the DAG before running the regression.

  3. Disaggregate before concluding. High overall performance may hide poor subgroup performance. Always check performance across relevant subgroups (class, demographic, dataset difficulty) before reporting aggregate metrics.

  4. For causal claims, randomize. If you want to say "our new model causes higher CTR," run an A/B test. An observational comparison cannot support a causal claim.

  5. In observational settings, state assumptions explicitly. "This analysis assumes no unmeasured confounders" is a testable and falsifiable claim. State it — don't hide it. Reviewers can then assess whether the assumption is plausible.

Test Your Understanding

  1. A data scientist finds that users who see more than 5 recommendations per page have 30% higher purchase rates. They propose showing all users 10+ recommendations to increase revenue. Identify all possible causal structures that could explain this correlation, and explain why the proposal may not work.

  2. A classification model achieves 94% accuracy overall on a test set. A fairness audit breaks performance down by demographic group and finds 90% accuracy for group A (80% of the test set) and 68% accuracy for group B (20% of the test set). Compute the overall accuracy implied by these numbers. What is this an example of, and what is the correct way to report model performance?

  3. You train 20 different feature engineering pipelines on the same dataset and report the best performing one (r=0.68 correlation with target). Compute the expected number of pipelines that would achieve r ≥ 0.68 purely by chance at α=0.05 if the true correlation is zero. What should you do instead before deploying this pipeline?

  4. Your company A/B tests a new recommendation model (n=50,000 per group) and finds a statistically significant 0.5% CTR improvement (p=0.001). The business declares the model causes higher CTR. What does the A/B test actually establish about causation, and under what conditions would the causal claim fail even in a randomized experiment?

  5. In a DAG for predicting loan default, you include: income → loan_size → default, and income → default directly. A colleague proposes controlling for loan_size to "isolate" the direct income effect. If the goal is to estimate the total causal effect of income on default, is controlling for loan_size correct? What does conditioning on loan_size do to the causal estimate?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment