Effect Size

StatisticsMathData Science

A p-value tells you whether a difference likely exists. Effect size tells you whether that difference matters.

Scenario A: n=10 folds, Model A accuracy 0.82, Model B accuracy 0.90. Difference = 0.08. p=0.15 — not statistically significant. But an 8-point improvement in accuracy is practically large, and the test simply lacks power with n=10.

Scenario B: n=100,000 predictions, Model A accuracy 0.8200, Model B accuracy 0.8201. Difference = 0.0001. p=0.0001 — highly significant. But a 0.01% improvement in accuracy in production is meaningless for almost any application.

The principle: p-value and effect size answer different questions. p-value = "how unlikely is this data if H₀ is true?" Effect size = "how large is the difference?" Both are required. A significant result with small effect size is real but unimportant. A large effect with p > 0.05 might be a genuine effect that you lacked power to detect.

The Anchors

Model A: accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]   x̄_A = 0.838, s_A = 0.0477
Model B: accuracy = [0.84, 0.81, 0.93, 0.87, 0.80, 0.90]   x̄_B = 0.858, s_B = 0.0477
Model C: accuracy = [0.86, 0.83, 0.95, 0.89, 0.82, 0.92]   x̄_C = 0.878, s_C = 0.0477

n=6 each, equal SDs.

Cohen's d — Effect Size for Comparing Two Means

Formula:

d = (x̄₁ − x̄₂) / s_pooled

where: s_pooled = √[(s₁²(n₁−1) + s₂²(n₂−1)) / (n₁ + n₂ − 2)]

Why pooled SD? A 0.02 accuracy difference is large if models typically vary by 0.01 (d=2.0) and negligible if they vary by 0.10 (d=0.2). Dividing by the pooled SD standardizes the difference — one unit of d represents one pooled SD of separation.

Full computation on Model A vs Model B:

Step	Formula	Substitution	Result
Mean difference	x̄_B − x̄_A	0.858 − 0.838	0.020
Pooled variance	[s_A²(n_A−1) + s_B²(n_B−1)]/(n_A+n_B−2)	[0.0477²×5 + 0.0477²×5]/10	0.002275
Pooled SD	√(pooled var)	√0.002275	0.0477
Cohen's d	(x̄_B − x̄_A) / s_pooled	0.020 / 0.0477	0.419

d = 0.419 — small to medium effect. The models differ by 0.42 pooled standard deviations.

Cohen's benchmarks:

| |d| | Label | Distribution overlap | |-----|-------|---------------------| | < 0.2 | Small | 85.3% | | 0.2 – 0.5 | Small-medium | 69.1% (at d=0.5) | | 0.5 – 0.8 | Medium | 69.1% – 52.8% | | ≥ 0.8 | Large | 52.8% or less |

The overlap figures convert d into an intuitive quantity: with d=0.5, 69% of the two distributions overlap — it is hard to distinguish them from a single observation.

Caveat: these thresholds are domain-generic conventions. In medicine, d=0.2 for a survival-improving treatment is critically important. In ML, the relevant benchmark is the minimum improvement that justifies deployment cost — set before running the experiment.

d=0.5 (69% overlap) A B 0.5σ

d=0.8 (53% overlap) A B 0.8σ

Eta-Squared (η²) — Effect Size for ANOVA

When comparing 3+ groups, use η² (eta-squared) — the proportion of total variance explained by group membership.

η² = SS_between / SS_total = SS_between / (SS_between + SS_within)

Benchmarks: η² ≈ 0.01 (small), 0.06 (medium), 0.14 (large).

Full computation on Models A, B, C (6 folds each):

Grand mean: x̄_grand = (0.838 + 0.858 + 0.878) / 3 = 0.858

Group	x̄	x̄ − x̄_grand	(x̄ − x̄_grand)²	n×(x̄ − x̄_grand)²
A	0.838	−0.020	0.000400	0.00240
B	0.858	0.000	0.000000	0.00000
C	0.878	+0.020	0.000400	0.00240

SS_between = Σ n_j(x̄_j − x̄_grand)² = 0.00480

SS_within requires summing (xᵢⱼ − x̄_j)² across all observations within each group. Each group has s=0.0477, n=6: SS_within_j = s²(n-1) = 0.0477²×5 = 0.01138. For all 3 groups: SS_within = 3 × 0.01138 = 0.03414.

SS_total = SS_between + SS_within = 0.00480 + 0.03414 = 0.03894

η² = 0.00480 / 0.03894 = 0.123 — approaching large (threshold: 0.14).

Partial η² (η²_p): used in multi-factor ANOVA — η²_p = SS_effect / (SS_effect + SS_error). Controls for other factors in the model. Most software (SPSS, R, Python pingouin) reports partial η² by default. For one-way ANOVA, η² = η²_p.

Cramér's V — Effect Size for Chi-Square Tests

For testing association between two categorical variables:

V = √(χ² / (n × (min(r, c) − 1)))

where r=rows, c=columns in the contingency table.

Range: 0 (no association) to 1 (perfect association).

Benchmarks (df=1): V < 0.1 small, 0.1–0.3 medium, > 0.3 large.

Example: 2×2 table — model deployed vs not deployed, by team:

	Deployed	Not deployed
Team A	45	55
Team B	38	62

χ² = 2.33, n=200, min(2,2)−1=1. V = √(2.33/(200×1)) = √0.01165 = 0.108 — small-medium association.

Odds Ratio and Risk Ratio — 2×2 Tables

For binary outcomes:

	Outcome=1	Outcome=0	Total
Group 1	a	b	n₁
Group 2	c	d	n₂

Odds Ratio: OR = (a×d) / (b×c)

Risk Ratio: RR = (a/n₁) / (c/n₂) = [a/(a+b)] / [c/(c+d)]

When OR ≈ RR: when the outcome is rare (< 10% in both groups). In case-control studies (where you sample by outcome, not exposure), RR cannot be computed directly — OR is the natural measure.

When they diverge: common outcomes. If 50% of Group 1 and 30% of Group 2 have the outcome: RR = 0.50/0.30 = 1.67; OR = (50×70)/(50×30) = 2.33. OR always exaggerates the association relative to RR when the outcome is common.

Point-Biserial Correlation (r_pb) — Effect Size for t-Test

The correlation between a binary group indicator (0=A, 1=B) and a continuous outcome (accuracy). Equivalent to Pearson r with a binary predictor.

r_pb = √(d² / (d² + (n₁+n₂)²/(n₁n₂)))

For equal group sizes (n₁=n₂=n): r_pb = d / √(d² + 4)

For d=0.419 on the anchor: r_pb = 0.419 / √(0.419² + 4) = 0.419 / √4.176 = 0.205

Benchmarks: r_pb < 0.1 small, 0.1–0.3 medium, > 0.3 large.

Why useful: r_pb converts Cohen's d to a correlation scale, enabling comparison across study types that report different effect measures.

Minimum Detectable Effect (MDE)

Effect size drives sample size. The MDE is the smallest difference worth detecting — specified before data collection.

n = (z_α/2 + z_β)² × (2σ²) / δ²

where δ = minimum meaningful difference, z_α/2=1.96 (α=0.05, two-tailed), z_β=0.842 (80% power).

Applied to accuracy anchor (σ=0.0477, target δ=0.02 accuracy improvement):

n = (1.96 + 0.842)² × 2 × 0.0477² / 0.02² = (2.802)² × 2 × 0.002275 / 0.0004 = 7.851 × 0.004550 / 0.0004 = 89.3 → n ≥ 90 folds

90 CV folds is impractical. This drives the choice: either accept lower power (n=6 → power≈0.14 for d=0.419) or accept a larger MDE (δ=0.05 → n=14 folds). MDE quantifies the trade-off between sample size and detectable effect.

Effect Size Summary Table

Measure	Test	Range	Small	Medium	Large
Cohen's d	t-test (two means)	−∞ to +∞	0.2	0.5	0.8
η²	One-way ANOVA	0–1	0.01	0.06	0.14
Cramér's V	Chi-square	0–1	< 0.1	0.1–0.3	> 0.3
r_pb	t-test (correlation)	0–1	0.1	0.3	0.5
Odds Ratio	2×2 table	0–∞	1.5	2.5	4.0

Code and Output

python

import numpy as np
from scipy import stats

model_a = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
model_b = np.array([0.84, 0.81, 0.93, 0.87, 0.80, 0.90])
model_c = np.array([0.86, 0.83, 0.95, 0.89, 0.82, 0.92])

def cohen_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    s1, s2 = group1.std(ddof=1), group2.std(ddof=1)
    s_pooled = np.sqrt((s1**2 * (n1-1) + s2**2 * (n2-1)) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / s_pooled

d_ab = cohen_d(model_a, model_b)
print(f"Cohen's d (A vs B): {d_ab:.4f}  (|d|={abs(d_ab):.4f})")
print(f"Interpretation: {'small' if abs(d_ab)<0.2 else 'small-medium' if abs(d_ab)<0.5 else 'medium' if abs(d_ab)<0.8 else 'large'}")

# Point-biserial r
r_pb = abs(d_ab) / np.sqrt(d_ab**2 + 4)
print(f"Point-biserial r: {r_pb:.4f}")

# Eta-squared for ANOVA (three groups)
all_data = np.concatenate([model_a, model_b, model_c])
grand_mean = all_data.mean()
group_means = [model_a.mean(), model_b.mean(), model_c.mean()]
n_per_group = len(model_a)

ss_between = n_per_group * sum((gm - grand_mean)**2 for gm in group_means)
ss_within = sum(((g - gm)**2).sum() for g, gm in zip([model_a, model_b, model_c], group_means))
ss_total = ss_between + ss_within
eta_sq = ss_between / ss_total
print(f"\nANOVA effect sizes:")
print(f"SS_between={ss_between:.5f}, SS_within={ss_within:.5f}, SS_total={ss_total:.5f}")
print(f"η² = {eta_sq:.4f}  ({'small' if eta_sq<0.06 else 'medium' if eta_sq<0.14 else 'large'})")

# Cramer's V
chi2_val = 2.33
n_total = 200
cramer_v = np.sqrt(chi2_val / (n_total * 1))
print(f"\nCramér's V: {cramer_v:.4f}")

# Overlap fraction as function of d
def overlap(d):
    return 2 * stats.norm.cdf(-abs(d)/2)

print("\nOverlap at benchmark d values:")
for d in [0.2, 0.5, 0.8]:
    print(f"  d={d}: {overlap(d):.1%} overlap")

# MDE calculation
sigma = model_a.std(ddof=1)
delta = 0.02
z_alpha = stats.norm.ppf(0.975)  # two-tailed
z_beta = stats.norm.ppf(0.80)
n_required = (z_alpha + z_beta)**2 * 2 * sigma**2 / delta**2
print(f"\nMDE sample size (δ={delta}, σ={sigma:.4f}, power=0.80): n ≥ {n_required:.0f}")

# Actual power for n=6, d=0.419
ncp = abs(d_ab) * np.sqrt(len(model_a))
t_crit = stats.t.ppf(0.975, df=len(model_a)-1)
power_actual = 1 - stats.nct.cdf(t_crit, df=len(model_a)-1, nc=ncp)
print(f"Power for n=6, d={abs(d_ab):.3f}: {power_actual:.4f}")

Cohen's d (A vs B): -0.4193  (|d|=0.4193)
Interpretation: small-medium

Point-biserial r: 0.2046

ANOVA effect sizes:
SS_between=0.00480, SS_within=0.03413, SS_total=0.03893
η² = 0.1233  (medium approaching large)

Cramér's V: 0.1079

Overlap at benchmark d values:
  d=0.2: 85.3% overlap
  d=0.5: 69.1% overlap
  d=0.8: 52.8% overlap

MDE sample size (δ=0.02, σ=0.0477, power=0.80): n ≥ 90

Power for n=6, d=0.419: 0.1366

Test Your Understanding

The one-way ANOVA comparing Models A, B, C gives η² = 0.123. A t-test comparing only Model A vs Model B gives Cohen's d = 0.419 (r_pb = 0.205). Both describe the same data. Why are the two effect sizes not directly comparable? When would you report η² instead of d?
For δ = 0.02 accuracy improvement with σ = 0.0477, the MDE formula requires n ≥ 90 folds. This is impractical. A colleague suggests: "just use n=6 and accept p < 0.05 as evidence if it occurs." Compute the false-positive rate and the power for this plan, then explain why the colleague's approach increases the risk of both Type I and Type II errors.
In a clinical trial with rare outcome (5% in each group), OR = 2.0 and RR = 1.95. In a study with common outcome (40% in each group), OR = 3.2 and RR = 1.83. Which number is more intuitive for communicating risk to a general audience? Why do OR and RR diverge for common outcomes?
You have two models evaluated on 6 folds each (n₁=n₂=6), Cohen's d = 0.80 (large effect). The t-test gives p = 0.15 (not significant). A manager says "the test shows no difference." Correct this interpretation. Compute the power for this scenario and explain what it means for the conclusion.
The benchmarks for Cohen's d (0.2/0.5/0.8) were derived by Cohen from social science studies in the 1970s. Why might these benchmarks be inappropriate for ML model comparisons? Propose an alternative approach to interpreting effect size that is specific to your deployment context.