~/blog

Types of ANOVA

Apr 11, 2026•10 min read•By Mohammed Vasim

StatisticsMathData Science

You ran a one-way ANOVA comparing three model architectures, got a significant F, and felt good — then a reviewer asks: "Did you consider the interaction between architecture and dataset type? And did you account for the fact that the same folds were used across architectures?" Those two questions correspond to two separate ANOVA designs. Picking the wrong one doesn't just waste power — it can miss the finding entirely.

Taxonomy

Factor: an independent variable that defines groups. In ML: model architecture, dataset type, hyperparameter setting.

Levels: the categories of a factor. Architecture has 3 levels: Transformer, LSTM, CNN.

Between-subjects: different experimental units in each group (one-way ANOVA — different folds per architecture).

Within-subjects: the SAME experimental units contribute to all groups (repeated-measures — same folds evaluated on all architectures).

Fixed effects: factor levels are deliberately chosen. You want conclusions about exactly these levels (e.g., these specific architectures).

Random effects: factor levels are a random sample from a larger population (e.g., random fold splits). You want to generalize beyond the specific levels tested.

One-Way ANOVA

When: one factor, multiple levels, between-subjects design.

ML use case: comparing k architectures evaluated on separate (non-overlapping) folds.

Anchor (one-way)

python

model_A = [0.821, 0.847, 0.835, 0.812, 0.859, 0.828]  # Transformer, x̄=0.834
model_B = [0.791, 0.803, 0.789, 0.812, 0.798, 0.785]  # LSTM, x̄=0.796
model_C = [0.863, 0.879, 0.855, 0.871, 0.868, 0.852]  # CNN, x̄=0.865
# k=3 groups, n=6 per group, N=18 total

ANOVA Table (computed)

Grand mean x̄ = 14.968/18 = 0.8316

SS_between = 6×(0.834−0.832)² + 6×(0.796−0.832)² + 6×(0.865−0.832)² = 6×0.0000044 + 6×0.0012406 + 6×0.0010963 = 0.01405

SS_within = SS_A + SS_B + SS_C = 0.001483 + 0.000503 + 0.000513 = 0.002499

Source	df	SS	MS	F
Between (arch)	2	0.01405	0.007024	42.1
Within (error)	15	0.002499	0.000167	—
Total	17	0.016549	—	—

F(2,15) = 42.1, p < 0.0001. Reject H₀. At least one architecture differs.

Tukey HSD Post-Hoc

HSD = q(0.05, k=3, df=15) × √(MS_within/n) = 3.67 × √(0.000167/6) = 3.67 × 0.00527 = 0.01934

| Pair | |x̄ᵢ − x̄ⱼ| | HSD | Significant? | |------|----------|-----|-------------| | Transformer vs LSTM | 0.0373 | 0.0193 | Yes | | Transformer vs CNN | 0.0310 | 0.0193 | Yes | | LSTM vs CNN | 0.0683 | 0.0193 | Yes |

All three architectures differ significantly.

Two-Way ANOVA

When: two factors simultaneously, between-subjects. Each factor-level combination is a "cell."

ML use case: accuracy depends on architecture AND dataset type, and you want to know if the architecture effect is consistent across dataset types (interaction).

Factor A: Architecture (Transformer, LSTM, CNN) — 3 levels. Factor B: Dataset type (clean, noisy) — 2 levels.

Cell Means Table

	Clean data	Noisy data	Row mean
Transformer	0.871	0.823	0.847
LSTM	0.809	0.772	0.790
CNN	0.878	0.841	0.860
Column mean	0.853	0.812	0.832

Three Tests in One

1. Main effect of Factor A (architecture): does mean accuracy differ across architectures, averaged over dataset type?

2. Main effect of Factor B (dataset): is clean data significantly better than noisy, averaged over architectures?

3. Interaction effect A×B: does the architecture ranking change depending on dataset type? If CNN is best for clean but not for noisy — that is an interaction.

SS Decomposition

SS_total = SS_A + SS_B + SS_AB + SS_within

Source	SS	df
Factor A (arch)	SS_A	a−1 = 2
Factor B (data)	SS_B	b−1 = 1
Interaction A×B	SS_AB	(a−1)(b−1) = 2
Within (error)	SS_W	ab(n−1) = 6

Interaction Interpretation

If interaction is significant: the architecture effect changes depending on dataset. You cannot make a single architecture recommendation without specifying the dataset. CNN may be best for clean data but not noisy data.

If interaction is NOT significant: the factors act independently. You can recommend the best architecture regardless of dataset.

Parallel lines here indicate no significant interaction — each architecture's performance drops by roughly the same amount from clean to noisy data. The architecture ranking is stable: CNN > Transformer > LSTM regardless of dataset type.

Repeated-Measures ANOVA

When: the same experimental units (folds) are measured under all conditions.

ML use case: the same 4 CV folds are evaluated on all 3 architectures. Because the same fold data is used across architectures, observations within a fold are correlated. Ignoring this correlation uses one-way ANOVA and inflates residual variance.

Why It Is More Powerful

In one-way ANOVA, fold-to-fold variability (some folds are just harder) ends up in SS_within, inflating the denominator. Repeated-measures ANOVA partitions out that fold-level variance:

SS_total = SS_between_subjects + SS_within_subjects SS_within_subjects = SS_treatment + SS_error

Removing SS_between_subjects (fold-to-fold differences) from the error term gives a smaller denominator → larger F → higher power.

Sphericity Assumption

Repeated-measures ANOVA requires sphericity: the variances of differences between all pairs of conditions must be equal.

Mauchly's test: H₀ = sphericity holds. If p < 0.05, apply correction.
Greenhouse-Geisser (GG): conservative. Use when ε < 0.75.
Huynh-Feldt (HF): less conservative. Use when ε ≥ 0.75.

Both corrections reduce the effective df, making the test more conservative. ε = 1.0 means sphericity holds perfectly; ε < 1.0 means violations, and df is multiplied by ε.

Decision Framework

Design	Factors	Observations	ML Use Case
One-Way	1	Independent (between)	Compare k architectures on separate folds
Two-Way	2	Independent (between)	Architecture × dataset size; arch × regularization
Repeated-Measures	1+	Correlated (within)	Same folds evaluated on all models
Mixed ANOVA	1 between + 1 within	Both	Architecture (between) × training stage (within)

Key question: are the same experimental units (folds, subjects, users) measured under multiple conditions? If yes → repeated-measures or mixed.

Post-Hoc Tests

Every significant ANOVA must be followed by a post-hoc test. ANOVA only tells you "at least one group differs."

Test	Controls	Conservative?	Use When
Tukey's HSD	FWER (all pairs)	Moderate	All pairwise comparisons
Bonferroni	FWER (specific pairs)	Most	Specific comparisons planned in advance
Scheffé	FWER (all contrasts)	Most conservative	Custom contrasts (e.g., A vs average of B+C)

Use Tukey's HSD by default. Bonferroni is more conservative than Tukey for all-pairs comparisons — use it only when testing specific pre-planned comparisons. Scheffé handles arbitrary contrasts, not just pairwise.

Code