~/blog

ANOVA

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

You have three model variants (A, B, C) each evaluated on the same 6 CV folds. You want to know whether they truly differ in accuracy. The first instinct is to run pairwise t-tests — but that creates a hidden problem.

Why Not Multiple t-Tests?

With k=3 models, there are C(3,2) = 3 pairwise comparisons. Running 3 separate t-tests at α=0.05, the familywise error rate is:

FWER = 1 − (1 − 0.05)³ = 1 − 0.857 = 14.3%

You have a 14% chance of at least one false positive — not 5%. With k=5 models (10 comparisons), FWER ≈ 40%.

ANOVA tests all groups simultaneously with a single F-statistic, keeping the overall Type I error at α=0.05.

H₀: μ_A = μ_B = μ_C (all population means are equal) H₁: At least one μᵢ ≠ μⱼ (at least one pair differs)

ANOVA answers: "Is there any difference among the group means?" It does NOT identify which groups differ — that requires post-hoc tests.

The Anchor

python

model_a = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]  # x̄_A = 0.838
model_b = [0.84, 0.81, 0.93, 0.87, 0.80, 0.90]  # x̄_B = 0.858
model_c = [0.78, 0.75, 0.87, 0.81, 0.74, 0.84]  # x̄_C = 0.798
# k=3 groups, n=6 per group, N=18 total

Core Idea: Partitioning Variance

ANOVA decomposes total variance into two components:

SS_total = SS_between + SS_within

SS_between (between-group): how much do the group means vary from each other? If models A, B, C have very different means, this is large.
SS_within (within-group / error): how much do individual fold scores vary within each group? This is fold-to-fold variability the model has no control over.
F-ratio: signal-to-noise. Large between-group variance relative to within-group variance → groups differ.

Step 1 — SS_between

SS_between = Σᵢ nᵢ × (x̄ᵢ − x̄)²

Compute grand mean: All 18 values: sum = 5.03 + 5.15 + 4.79 = 14.97 x̄ = 14.97 / 18 = 0.8317

Group means:

x̄_A = 5.03 / 6 = 0.8383
x̄_B = 5.15 / 6 = 0.8583
x̄_C = 4.79 / 6 = 0.7983

Three terms:

Model A: 6 × (0.8383 − 0.8317)² = 6 × (0.0067)² = 6 × 0.0000444 = 0.000267
Model B: 6 × (0.8583 − 0.8317)² = 6 × (0.0267)² = 6 × 0.000711 = 0.004267
Model C: 6 × (0.7983 − 0.8317)² = 6 × (−0.0333)² = 6 × 0.001111 = 0.006667

SS_between = 0.000267 + 0.004267 + 0.006667 = 0.01120

df_between = k − 1 = 3 − 1 = 2 MS_between = 0.01120 / 2 = 0.00560

Step 2 — SS_within

SS_within = Σᵢ Σⱼ (xᵢⱼ − x̄ᵢ)²

All 18 deviations:

Fold	Model A	A dev	A dev²	Model B	B dev	B dev²	Model C	C dev	C dev²
1	0.82	−0.018	0.000336	0.84	−0.018	0.000336	0.78	−0.018	0.000336
2	0.79	−0.048	0.002336	0.81	−0.048	0.002336	0.75	−0.048	0.002336
3	0.91	+0.072	0.005137	0.93	+0.072	0.005137	0.87	+0.072	0.005137
4	0.85	+0.012	0.000136	0.87	+0.012	0.000136	0.81	+0.012	0.000136
5	0.78	−0.058	0.003403	0.80	−0.058	0.003403	0.74	−0.058	0.003403
6	0.88	+0.042	0.001736	0.90	+0.042	0.001736	0.84	+0.042	0.001736

SS_A = 0.013083, SS_B = 0.013083, SS_C = 0.013083 (identical because B = A + 0.02 per fold, C = A − 0.04 per fold — constant shifts don't change within-group variance)

SS_within = 3 × 0.013083 = 0.039250

df_within = N − k = 18 − 3 = 15 MS_within = 0.039250 / 15 = 0.002617

Step 3 — SS_total Verification

SS_total = Σᵢⱼ (xᵢⱼ − x̄)² computed directly = 0.050450

Check: SS_between + SS_within = 0.01120 + 0.039250 = 0.050450 ✓

df_total = N − 1 = 17

The ANOVA Table

Source	df	SS	MS	F
Between (Model)	2	0.01120	0.00560	2.140
Within (Error)	15	0.03925	0.002617	—
Total	17	0.05045	—	—

The F-Ratio and Decision

F = MS_between / MS_within = 0.00560 / 0.002617 = 2.140

Why F follows the F-distribution: under H₀, MS_between/σ² ~ χ²(k−1)/(k−1) and MS_within/σ² ~ χ²(N−k)/(N−k), and they are independent. The ratio of two scaled chi-square variates is the F-distribution. F is always right-tailed — large values indicate group means differ more than random chance would produce.

F_critical(df₁=2, df₂=15, α=0.05) = 3.682 p-value: P(F(2,15) > 2.140) = 0.150

Decision: F = 2.140 < F_critical = 3.682. p = 0.150 > 0.05. Fail to reject H₀.

With n=6 folds per model, the within-group variance (fold-to-fold noise: s≈0.047) overwhelms the between-group signal (means differ by 0.02–0.04). This is a power problem, not a signal problem — the models may genuinely differ, but we need more folds to confirm it.

Effect Size: η²

η² = SS_between / SS_total = 0.01120 / 0.05045 = 0.222

η²	Interpretation
0.01	Small
0.06	Medium
0.14	Large

η²=0.222 is a large effect — the model choice explains 22% of total variance in accuracy. Despite not reaching statistical significance, the effect is real and practically meaningful. With more folds, this would become significant.

Post-Hoc Tests: Tukey HSD

Rule: post-hoc tests should only be run after a significant ANOVA. Since F=2.14 is not significant, running post-hoc tests is technically invalid — it inflates Type I error (the ANOVA gates the post-hoc tests). We show the mechanics for illustration only.

HSD = q_α × √(MS_within / n)

With q(α=0.05, k=3, df_within=15) = 3.67 and MS_within = 0.002617, n=6:

HSD = 3.67 × √(0.002617 / 6) = 3.67 × √0.000436 = 3.67 × 0.02089 = 0.0767

Pairwise differences:

|x̄_B − x̄_A| = 0.020 < 0.077 → not significant
|x̄_A − x̄_C| = 0.040 < 0.077 → not significant
|x̄_B − x̄_C| = 0.060 < 0.077 → not significant

Consistent with the non-significant ANOVA.

Python Code

python

from scipy import stats
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

model_a = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
model_b = [0.84, 0.81, 0.93, 0.87, 0.80, 0.90]
model_c = [0.78, 0.75, 0.87, 0.81, 0.74, 0.84]

# Quick F-test
f_stat, p_value = stats.f_oneway(model_a, model_b, model_c)
print(f"F={f_stat:.4f}, p={p_value:.4f}")

# Full ANOVA table via statsmodels
data = pd.DataFrame({
    'accuracy': model_a + model_b + model_c,
    'model': ['A']*6 + ['B']*6 + ['C']*6
})
lm = ols('accuracy ~ C(model)', data=data).fit()
print(sm.stats.anova_lm(lm, typ=1))

# Effect size eta-squared
all_vals = np.array(model_a + model_b + model_c)
grand_mean = all_vals.mean()
group_means = np.array([np.mean(model_a), np.mean(model_b), np.mean(model_c)])
ss_between = 6 * np.sum((group_means - grand_mean)**2)
ss_total = np.sum((all_vals - grand_mean)**2)
eta_sq = ss_between / ss_total
print(f"\nη² = {eta_sq:.4f}")

# Tukey HSD (for illustration — ANOVA is not significant)
result = stats.tukey_hsd(model_a, model_b, model_c)
print("\nTukey HSD p-values:")
labels = ['A', 'B', 'C']
for i in range(3):
    for j in range(i+1, 3):
        print(f"  Model {labels[i]} vs {labels[j]}: p={result.pvalue[i][j]:.4f}")

# Kruskal-Wallis (non-parametric alternative)
H_stat, p_kruskal = stats.kruskal(model_a, model_b, model_c)
print(f"\nKruskal-Wallis: H={H_stat:.4f}, p={p_kruskal:.4f}")

text

F=2.1397, p=0.1501

              df    sum_sq    mean_sq         F    PR(>F)
C(model)     2.0  0.011200  0.005600  2.139726  0.150133
Residual    15.0  0.039250  0.002617       NaN       NaN

η² = 0.2220

Tukey HSD p-values:
  Model A vs B: p=0.5113
  Model A vs C: p=0.2461
  Model B vs C: p=0.0959

Kruskal-Wallis: H=3.6000, p=0.1653

Writing Up Results

One-way ANOVA revealed no statistically significant effect of model variant on CV accuracy, F(2,15) = 2.14, p = 0.150, η² = 0.222. With only n=6 folds per model, the within-group fold-to-fold variability (s≈0.047) is large relative to the between-model differences (0.02–0.06 accuracy points). The large η²=0.222 suggests the effect is practically meaningful — deployment decisions should be informed by the effect size alongside the p-value, and a larger evaluation (n≥30 folds) would likely reach significance.

Test Your Understanding

You add a fourth model (Model D = [0.86, 0.83, 0.95, 0.89, 0.82, 0.92]). How do the three df values (df_between, df_within, df_total) change? Compute the new grand mean and SS_between for all four models.
F = MS_between / MS_within = 2.14. If you tripled n to 18 folds per model (while keeping the same group means), how would SS_between, SS_within, and F change? Would the test reach significance at α=0.05?
η² = 0.222 is "large" but p = 0.150 is not significant. Explain the apparent contradiction — how can an effect be large but not statistically significant?
The Kruskal-Wallis test gives H=3.60, p=0.165. The parametric ANOVA gives F=2.14, p=0.150. They reach the same non-significant conclusion but with different test statistics. When would you prefer Kruskal-Wallis over ANOVA, and what assumption does ANOVA require that Kruskal-Wallis does not?
The three group SS values are all equal (SS_A = SS_B = SS_C = 0.013083). Why? If you changed one fold score in Model B from 0.93 to 0.99, how would this affect SS_B specifically, and would it increase or decrease F?

ANOVA

Why Not Multiple t-Tests?

The Anchor

Core Idea: Partitioning Variance

Step 1 — SS_between

Step 2 — SS_within

Step 3 — SS_total Verification

The ANOVA Table

The F-Ratio and Decision

Effect Size: η²

Post-Hoc Tests: Tukey HSD

Python Code

Writing Up Results

Test Your Understanding

Comments (0)

Leave a comment