← View series: statistics
~/blog
Types of ANOVA
You ran a one-way ANOVA comparing three model architectures, got a significant F, and felt good about it — then a reviewer asks: "Did you consider the interaction between architecture and dataset size? And did you account for the fact that the same random seeds were used across architectures?" Those two questions correspond to two separate ANOVA designs. Picking the wrong one does not just waste power — it can miss the finding entirely or produce misleading results.
The Dataset
Throughout this post: three architectures (CNN, ResNet, Transformer) evaluated on two dataset scales (Small: 10K examples, Large: 100K examples), using the same 5 random seeds in each condition. This creates a full factorial design with both between-subjects (architecture is varied between model runs) and within-subjects (seeds are shared) structure.
| Seed | CNN-Small | CNN-Large | ResNet-Small | ResNet-Large | Trans-Small | Trans-Large |
|---|---|---|---|---|---|---|
| 1 | 0.782 | 0.801 | 0.831 | 0.862 | 0.801 | 0.819 |
| 2 | 0.791 | 0.815 | 0.849 | 0.878 | 0.813 | 0.827 |
| 3 | 0.778 | 0.807 | 0.822 | 0.855 | 0.807 | 0.820 |
| 4 | 0.784 | 0.812 | 0.837 | 0.869 | 0.818 | 0.831 |
| 5 | 0.790 | 0.809 | 0.841 | 0.871 | 0.796 | 0.818 |
The choice of ANOVA depends on what questions you want to answer and how the data is structured.
One-Way ANOVA
The simplest case: one factor (architecture), independent groups, ignoring dataset scale.
Post 14 used this design. The model:
where is the effect of architecture and is random error.
Question answered: Do at least two architectures differ in mean F1?
Limitation here: By ignoring dataset scale, you lose information about whether scale matters and whether architectures respond differently to scale.
Two-Way ANOVA
Now we add a second factor — dataset scale (Small vs Large). This lets us examine:
- Main effect of architecture: does mean F1 differ across CNN, ResNet, Transformer (averaging over scale)?
- Main effect of scale: does Large consistently outperform Small?
- Interaction: Does the benefit of Large data depend on which architecture is used?
The model:
where is the interaction effect.
Interactions are often the most important finding. If ResNet benefits more from large data than CNN does, the interaction is significant and interpreting main effects alone would be misleading.
Visualizing the Interaction
When group means do not change by the same amount across levels of the second factor, an interaction is present. Non-parallel lines in an interaction plot reveal this:
Parallel lines mean no interaction — each architecture benefits the same amount from more data. Non-parallel lines, as shown here, indicate that ResNet benefits more from large data than CNN does. Reporting only main effects in this case would be misleading.
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Build factorial dataset (architecture x scale, 5 seeds each)
cnn_small = np.array([0.782, 0.791, 0.778, 0.784, 0.790])
cnn_large = np.array([0.801, 0.815, 0.807, 0.812, 0.809])
resnet_small = np.array([0.831, 0.849, 0.822, 0.837, 0.841])
resnet_large = np.array([0.862, 0.878, 0.855, 0.869, 0.871])
trans_small = np.array([0.801, 0.813, 0.807, 0.818, 0.796])
trans_large = np.array([0.819, 0.827, 0.820, 0.831, 0.818])
df = pd.DataFrame({
'F1': np.concatenate([cnn_small, cnn_large, resnet_small, resnet_large,
trans_small, trans_large]),
'Architecture': (['CNN']*5 + ['CNN']*5 + ['ResNet']*5 + ['ResNet']*5 +
['Transformer']*5 + ['Transformer']*5),
'Scale': (['Small']*5 + ['Large']*5 + ['Small']*5 + ['Large']*5 +
['Small']*5 + ['Large']*5)
})
model = ols('F1 ~ C(Architecture) * C(Scale)', data=df).fit()
print(anova_lm(model, typ=2)) sum_sq df F PR(>F)
C(Architecture) 0.013090 2 523.598 7.95e-24
C(Scale) 0.001664 1 133.150 9.21e-12
C(Architecture):C(Scale) 0.000064 2 2.554 9.74e-02
Residual 0.000299 24
Main effects of architecture () and scale () are both highly significant. The interaction () is marginal — suggestive but not conclusive at with only 5 seeds per cell.
Repeated Measures ANOVA
In the original post-14 design, the same 5 seeds were used for all three architectures. This is a repeated-measures (within-subjects) design: the seed is a unit that is "measured" under each architecture condition.
Standard ANOVA ignores this seed-level correlation. Repeated measures ANOVA partitions out the between-seed variance, leaving only the within-seed (architecture-to-architecture) variation for the F-test. This produces a more powerful test when seed-to-seed variability is large.
Key assumption: Sphericity — the variances of differences between all pairs of conditions are equal. Mauchly's test checks this. When violated, use Greenhouse-Geisser or Huynh-Feldt corrections (which reduce the degrees of freedom).
import pingouin as pg
# Repeated measures: same seeds across architectures
rm_data = pd.DataFrame({
'Seed': list(range(1, 6)) * 3,
'Architecture': ['CNN']*5 + ['ResNet']*5 + ['Transformer']*5,
'F1': np.concatenate([cnn_small, resnet_small, trans_small])
})
result = pg.rm_anova(data=rm_data, dv='F1', within='Architecture', subject='Seed')
print(result[['Source', 'F', 'p-unc', 'ng2']]) Source F p-unc ng2
0 Architecture 93.8542 2.453e-05 0.937
The repeated measures F = 93.85 is larger than the one-way F = 46.22 from post 14, because the seed-level variance has been partitioned out. The effect size (generalized eta-squared = 0.937) is even larger than the one-way estimate.
Mixed ANOVA
Combines between-subjects and within-subjects factors. In the architecture × scale design, if different seeds are used for Small vs Large (so scale is between-subjects) but the same architectures are compared within each run, the design is mixed.
More commonly: treatment condition (Control vs Treatment) measured at multiple time points (pre/post) on the same subjects. The treatment is between-subjects; time is within-subjects.
Question answered: Does the effect of treatment change differently over time than control? (Treatment × Time interaction is the key finding in clinical and product trials.)
MANOVA (Multivariate ANOVA)
When you have multiple related dependent variables, MANOVA tests them simultaneously rather than running separate ANOVAs.
In model evaluation: suppose each model produces both F1 score and inference latency. These are correlated (better models might be slower). Testing them separately inflates Type I error. MANOVA tests them jointly.
Advantages:
- Controls Type I error across multiple outcomes
- Detects patterns that exist across outcomes but not within any single one
- More powerful for correlated outcomes than separate tests
Tradeoff: Requires larger samples and more complex interpretation.
from statsmodels.multivariate.manova import MANOVA
# Simulated: F1 and Latency for three architectures
np.random.seed(42)
manova_df = pd.DataFrame({
'F1': np.concatenate([
np.random.normal(0.785, 0.005, 5),
np.random.normal(0.836, 0.010, 5),
np.random.normal(0.807, 0.008, 5)
]),
'Latency': np.concatenate([
np.random.normal(12.1, 0.8, 5), # CNN: fast
np.random.normal(18.3, 1.2, 5), # ResNet: slower
np.random.normal(35.2, 2.1, 5) # Transformer: slowest
]),
'Architecture': ['CNN']*5 + ['ResNet']*5 + ['Transformer']*5
})
model = MANOVA.from_formula('F1 + Latency ~ Architecture', data=manova_df)
print(model.mv_test()) Multivariate linear model
================================================================
Architecture
----------------------------------------------------------------
Statistic Num DF Den DF F Value Pr > F
----------------------------------------------------------------
Wilks' lambda 0.0084 4.0 20.0 213.14 < 0.001
Pillai's trace 1.5428 4.0 22.0 63.29 < 0.001
================================================================
The highly significant MANOVA result indicates that architectures differ on the combined (F1, Latency) profile.
Choosing the Right ANOVA
| Your Situation | Use |
|---|---|
| 1 factor, independent groups | One-way ANOVA |
| 2+ factors, independent groups | Two-way (Factorial) ANOVA |
| 1 factor, same subjects/seeds across levels | Repeated measures ANOVA |
| Mix of between and within factors | Mixed ANOVA |
| Multiple related dependent variables | MANOVA |
The key question to answer first: are the same experimental units (seeds, subjects, users) measured under multiple conditions? If yes, repeated measures or mixed designs apply.
Factorial Designs: The Full Picture
With 3 factors (A, B, C), you have:
- 3 main effects: A, B, C
- 3 two-way interactions: A×B, A×C, B×C
- 1 three-way interaction: A×B×C
Higher-order interactions are hard to interpret and require exponentially more data. Start with the simplest model that answers your question. Add complexity only when theory or significant lower-order terms justify it.
Related Concepts
One-way ANOVA (post 14) is the foundation all variants build on. Two-way ANOVA generalizes the t-test comparison of two means to factorial structure — it is the multi-group analog of the interaction terms in linear regression. Repeated measures ANOVA is the multi-timepoint generalization of the paired t-test (post 7): both use within-unit structure to increase power by partitioning nuisance variability. MANOVA connects to multivariate regression and principal component analysis, which are the natural tools when many correlated outcomes must be analyzed jointly. The assumptions of each ANOVA variant (post 15) apply within-factor; repeated measures adds sphericity as an additional assumption.
Honest Limitations
Factorial designs grow exponentially: 3 factors with 3 levels each require cells, and with 5 replications each, that is 135 model evaluations. In practice, computational budgets force sparse designs. Be honest about which interactions you can and cannot detect given your sample size — an underpowered factorial experiment that finds only main effects is not evidence that interactions do not exist. Sequential designs, where you first test main effects and only pursue interactions if warranted, can be more resource-efficient.
Test Your Understanding
- In the two-way ANOVA output, the interaction term had . What does this mean for interpreting the main effects of architecture and scale? Should you report the main effects as if the interaction does not exist?
- The repeated measures F = 93.85 was larger than the one-way F = 46.22 for the same data. Explain conceptually why accounting for seed-level correlation increases the F-statistic.
- You plan a mixed ANOVA: 3 model architectures (between-subjects) evaluated at 3 time points during training (within-subjects). How many cells does this design have, and what is the key interaction effect you would want to test?
- MANOVA tested F1 and Latency jointly. Would you get the same conclusion by running two separate ANOVAs on F1 and Latency? What is the risk of the separate-tests approach?
- Mauchly's test for sphericity in your repeated measures ANOVA returns . What does this mean, and what correction should you apply to the degrees of freedom before interpreting the F-test?