← View series: statistics
~/blog
ANOVA
You have three model architectures — a baseline CNN, a ResNet variant, and a transformer — each evaluated on 5 random seeds. You want to know whether they truly differ in performance. Running three pairwise t-tests would work mechanically, but with 5 architectures it becomes 10 tests, and with 10 it becomes 45. Each test carries a 5% false positive risk, so running many of them inflates your chance of declaring a spurious winner. ANOVA solves this by testing all groups simultaneously with a single test statistic.
The Core Idea: Signal to Noise
ANOVA does not compare means directly. It compares variance. Specifically:
Think of the F-ratio as a signal-to-noise ratio:
- Signal: How much do the group means differ from the overall mean? (Between-group variance)
- Noise: How much do individual scores vary around their own group mean? (Within-group variance)
If the models all have the same true mean, differences between group means are just sampling noise. The F-ratio will be near 1. If some model is genuinely better, the between-group variance inflates — F grows large.
The Dataset
Three model architectures evaluated on the same 5 random seeds:
| Seed | CNN | ResNet | Transformer |
|---|---|---|---|
| 1 | 0.782 | 0.831 | 0.801 |
| 2 | 0.791 | 0.849 | 0.813 |
| 3 | 0.778 | 0.822 | 0.807 |
| 4 | 0.784 | 0.837 | 0.818 |
| 5 | 0.790 | 0.841 | 0.796 |
Group means: , ,
Grand mean:
Note: this dataset uses independent observations per seed-model combination. For a paired (blocked) design with the same seeds, repeated measures ANOVA (post 16) is more powerful and appropriate.
ANOVA Hypotheses
(all architectures perform equally)
: At least one architecture differs
Computing SS Between Groups (Step by Step)
Between-group SS measures how far each group mean is from the grand mean:
Computing SS Within Groups (Step by Step)
Within-group SS measures variability of individual scores around their group mean:
CNN:
ResNet:
Transformer:
ANOVA Table
| Source | SS | df | MS | F |
|---|---|---|---|---|
| Between | ||||
| Within | ||||
| Total |
With , : the critical value .
Since : reject . The three architectures differ significantly in F1 performance.
| Phase | Formula | Values | Result |
|---|---|---|---|
| sum across all 15 observations | |||
| F |
Effect Size: Eta-Squared
88.5% of the total variance in F1 scores is explained by architecture choice. This is a very large effect — architecture matters enormously here.
| Interpretation | |
|---|---|
| 0.01 | Small |
| 0.06 | Medium |
| 0.14 | Large |
Post-Hoc Tests: Tukey HSD
ANOVA rejection tells you "at least one architecture differs" — not which ones. Tukey's Honestly Significant Difference test controls the family-wise error rate across all pairwise comparisons.
For each pair, compute the HSD threshold:
With (studentized range statistic) and , :
Pairwise differences:
- — significant
- — significant
- — significant
All three architectures are significantly different from each other.
Non-Parametric Alternative: Kruskal-Wallis
When normality is violated, use the Kruskal-Wallis test — the non-parametric equivalent of one-way ANOVA. It ranks all observations jointly and tests whether ranks are distributed similarly across groups.
import numpy as np
from scipy import stats
cnn_scores = np.array([0.782, 0.791, 0.778, 0.784, 0.790])
resnet_scores = np.array([0.831, 0.849, 0.822, 0.837, 0.841])
trans_scores = np.array([0.801, 0.813, 0.807, 0.818, 0.796])
F_stat, p_value = stats.f_oneway(cnn_scores, resnet_scores, trans_scores)
print(f"ANOVA: F={F_stat:.4f}, p={p_value:.6f}")
grand_mean = np.concatenate([cnn_scores, resnet_scores, trans_scores]).mean()
ss_between = (5*(cnn_scores.mean() - grand_mean)**2 +
5*(resnet_scores.mean() - grand_mean)**2 +
5*(trans_scores.mean() - grand_mean)**2)
ss_total = np.sum((np.concatenate([cnn_scores, resnet_scores, trans_scores]) - grand_mean)**2)
eta_sq = ss_between / ss_total
print(f"Eta-squared: {eta_sq:.4f}")
# Tukey HSD post-hoc test
result = stats.tukey_hsd(cnn_scores, resnet_scores, trans_scores)
print(f"\nTukey HSD p-values:")
labels = ["CNN", "ResNet", "Transformer"]
for i in range(3):
for j in range(i+1, 3):
print(f" {labels[i]} vs {labels[j]}: p={result.pvalue[i][j]:.4f}")
# Kruskal-Wallis (non-parametric alternative)
H_stat, p_kruskal = stats.kruskal(cnn_scores, resnet_scores, trans_scores)
print(f"\nKruskal-Wallis: H={H_stat:.4f}, p={p_kruskal:.6f}")ANOVA: F=46.2163, p=0.000004
Eta-squared: 0.8851
Tukey HSD p-values:
CNN vs ResNet: p=0.0001
CNN vs Transformer: p=0.0013
ResNet vs Transformer: p=0.0016
Kruskal-Wallis: H=12.7143, p=0.001740
When ANOVA Does Not Apply
- Unequal variances: Use Welch's ANOVA (adjusts degrees of freedom)
- Non-normal residuals with small : Use Kruskal-Wallis
- Same subjects across groups: Use repeated measures ANOVA (post 16)
- Very small samples: Check power carefully before interpreting results
Related Concepts
ANOVA extends the two-sample t-test (post 7) to groups by reframing the comparison in terms of variance. The F-statistic is the ratio of two chi-square variables divided by their degrees of freedom — connecting to post 12's chi-square framework. ANOVA's assumptions (normality, homoscedasticity, independence) are examined in detail in post 15. The different ANOVA designs (two-way, repeated measures, mixed) are covered in post 16. The multiple comparison problem that motivates ANOVA over repeated t-tests is the same problem addressed by Bonferroni and FDR corrections in post 9.
Honest Limitations
One-way ANOVA is a global test — rejecting only tells you that at least one group differs, not which one or by how much. Post-hoc tests like Tukey HSD answer the "which ones" question but use up some of the power. Eta-squared is slightly biased upward for small samples; omega-squared () is a less biased alternative. And with only 5 seeds per architecture, this ANOVA has limited power to detect small effects — the very large F here only works because the true architecture effects are enormous.
Test Your Understanding
- You add a fourth architecture (Vision Transformer) with scores . Compute for all four groups with the new grand mean and explain how df changes.
- The ANOVA F-ratio is described as a signal-to-noise ratio. What constitutes "signal" and what constitutes "noise" for the architecture comparison dataset? What would push F toward 1?
- Tukey HSD found all three architectures significantly different. What additional information would a product team need beyond "significantly different" to decide which architecture to deploy?
- The Kruskal-Wallis test gave , . The parametric ANOVA gave , . Why are these so different despite testing the same data?
- Eta-squared for this dataset was 0.885. Without computing omega-squared, explain qualitatively whether you expect omega-squared to be higher or lower than 0.885 for these data, and why.