Back to blog
← View series: statistics

~/blog

ANOVA

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

You have three model architectures — a baseline CNN, a ResNet variant, and a transformer — each evaluated on 5 random seeds. You want to know whether they truly differ in performance. Running three pairwise t-tests would work mechanically, but with 5 architectures it becomes 10 tests, and with 10 it becomes 45. Each test carries a 5% false positive risk, so running many of them inflates your chance of declaring a spurious winner. ANOVA solves this by testing all groups simultaneously with a single test statistic.

The Core Idea: Signal to Noise

ANOVA does not compare means directly. It compares variance. Specifically:

Think of the F-ratio as a signal-to-noise ratio:

  • Signal: How much do the group means differ from the overall mean? (Between-group variance)
  • Noise: How much do individual scores vary around their own group mean? (Within-group variance)

If the models all have the same true mean, differences between group means are just sampling noise. The F-ratio will be near 1. If some model is genuinely better, the between-group variance inflates — F grows large.

The Dataset

Three model architectures evaluated on the same 5 random seeds:

SeedCNNResNetTransformer
10.7820.8310.801
20.7910.8490.813
30.7780.8220.807
40.7840.8370.818
50.7900.8410.796

Group means: , ,

Grand mean:

Note: this dataset uses independent observations per seed-model combination. For a paired (blocked) design with the same seeds, repeated measures ANOVA (post 16) is more powerful and appropriate.

ANOVA Hypotheses

(all architectures perform equally)

: At least one architecture differs

Computing SS Between Groups (Step by Step)

Between-group SS measures how far each group mean is from the grand mean:

0.77 0.79 0.81 0.83 0.85 0.809 (grand mean) CNN 0.785 ResNet 0.836 Transformer 0.807 Dots = SS_Between (spread between means vs grand mean) | Scatter = SS_Within

Computing SS Within Groups (Step by Step)

Within-group SS measures variability of individual scores around their group mean:

CNN:

ResNet:

Transformer:

ANOVA Table

SourceSSdfMSF
Between
Within
Total

With , : the critical value .

Since : reject . The three architectures differ significantly in F1 performance.

PhaseFormulaValuesResult
sum across all 15 observations
F

Effect Size: Eta-Squared

88.5% of the total variance in F1 scores is explained by architecture choice. This is a very large effect — architecture matters enormously here.

Interpretation
0.01Small
0.06Medium
0.14Large

Post-Hoc Tests: Tukey HSD

ANOVA rejection tells you "at least one architecture differs" — not which ones. Tukey's Honestly Significant Difference test controls the family-wise error rate across all pairwise comparisons.

For each pair, compute the HSD threshold:

With (studentized range statistic) and , :

Pairwise differences:

  • — significant
  • — significant
  • — significant

All three architectures are significantly different from each other.

Non-Parametric Alternative: Kruskal-Wallis

When normality is violated, use the Kruskal-Wallis test — the non-parametric equivalent of one-way ANOVA. It ranks all observations jointly and tests whether ranks are distributed similarly across groups.

python
import numpy as np
from scipy import stats

cnn_scores    = np.array([0.782, 0.791, 0.778, 0.784, 0.790])
resnet_scores = np.array([0.831, 0.849, 0.822, 0.837, 0.841])
trans_scores  = np.array([0.801, 0.813, 0.807, 0.818, 0.796])

F_stat, p_value = stats.f_oneway(cnn_scores, resnet_scores, trans_scores)
print(f"ANOVA: F={F_stat:.4f}, p={p_value:.6f}")

grand_mean = np.concatenate([cnn_scores, resnet_scores, trans_scores]).mean()
ss_between = (5*(cnn_scores.mean() - grand_mean)**2 +
              5*(resnet_scores.mean() - grand_mean)**2 +
              5*(trans_scores.mean() - grand_mean)**2)
ss_total = np.sum((np.concatenate([cnn_scores, resnet_scores, trans_scores]) - grand_mean)**2)
eta_sq = ss_between / ss_total
print(f"Eta-squared: {eta_sq:.4f}")

# Tukey HSD post-hoc test
result = stats.tukey_hsd(cnn_scores, resnet_scores, trans_scores)
print(f"\nTukey HSD p-values:")
labels = ["CNN", "ResNet", "Transformer"]
for i in range(3):
    for j in range(i+1, 3):
        print(f"  {labels[i]} vs {labels[j]}: p={result.pvalue[i][j]:.4f}")

# Kruskal-Wallis (non-parametric alternative)
H_stat, p_kruskal = stats.kruskal(cnn_scores, resnet_scores, trans_scores)
print(f"\nKruskal-Wallis: H={H_stat:.4f}, p={p_kruskal:.6f}")
ANOVA: F=46.2163, p=0.000004 Eta-squared: 0.8851 Tukey HSD p-values: CNN vs ResNet: p=0.0001 CNN vs Transformer: p=0.0013 ResNet vs Transformer: p=0.0016 Kruskal-Wallis: H=12.7143, p=0.001740

When ANOVA Does Not Apply

  • Unequal variances: Use Welch's ANOVA (adjusts degrees of freedom)
  • Non-normal residuals with small : Use Kruskal-Wallis
  • Same subjects across groups: Use repeated measures ANOVA (post 16)
  • Very small samples: Check power carefully before interpreting results

ANOVA extends the two-sample t-test (post 7) to groups by reframing the comparison in terms of variance. The F-statistic is the ratio of two chi-square variables divided by their degrees of freedom — connecting to post 12's chi-square framework. ANOVA's assumptions (normality, homoscedasticity, independence) are examined in detail in post 15. The different ANOVA designs (two-way, repeated measures, mixed) are covered in post 16. The multiple comparison problem that motivates ANOVA over repeated t-tests is the same problem addressed by Bonferroni and FDR corrections in post 9.

Honest Limitations

One-way ANOVA is a global test — rejecting only tells you that at least one group differs, not which one or by how much. Post-hoc tests like Tukey HSD answer the "which ones" question but use up some of the power. Eta-squared is slightly biased upward for small samples; omega-squared () is a less biased alternative. And with only 5 seeds per architecture, this ANOVA has limited power to detect small effects — the very large F here only works because the true architecture effects are enormous.

Test Your Understanding

  1. You add a fourth architecture (Vision Transformer) with scores . Compute for all four groups with the new grand mean and explain how df changes.
  2. The ANOVA F-ratio is described as a signal-to-noise ratio. What constitutes "signal" and what constitutes "noise" for the architecture comparison dataset? What would push F toward 1?
  3. Tukey HSD found all three architectures significantly different. What additional information would a product team need beyond "significantly different" to decide which architecture to deploy?
  4. The Kruskal-Wallis test gave , . The parametric ANOVA gave , . Why are these so different despite testing the same data?
  5. Eta-squared for this dataset was 0.885. Without computing omega-squared, explain qualitatively whether you expect omega-squared to be higher or lower than 0.885 for these data, and why.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment