Back to blog
← View series: statistics

~/blog

t-Test

Apr 11, 20267 min readBy mohammed.vasim
StatisticsMathData Science

You are comparing two versions of a classification model. Model A was trained on the full training set; Model B was trained with an additional data augmentation pipeline. You want to know whether B genuinely outperforms A. You do not know the true variance of performance scores across random seeds. You reach for a t-test — but which one? Getting the right variant matters: the wrong version either wastes your experiment's power or answers the wrong question entirely.

The Dataset

Throughout this post, the same experiment serves every example:

  • Model A (no augmentation): F1 scores on 6 random seeds —
  • Model B (with augmentation): F1 scores on 6 random seeds —
  • Paired design: the same 6 random seeds were used for both models (same data splits, same initialization seeds)

, , ,

Variant 1: One-Sample T-Test

Use this when you have one set of scores and want to compare its mean to a known or hypothesized value.

Example: Is Model A's F1 significantly above 0.80 (baseline threshold)?

With , critical value . Since , reject : Model A is significantly above 0.80.

PhaseFormulaValuesResult
Standard error
t statistic
Critical valuedf=5 table
Decision$t> t^*$?

Variant 2: Independent Two-Sample T-Test

Use this when you have two independent groups — different random seeds for A and B, for example.

Why Welch's is the default: The pooled t-test assumes equal variances (). In practice, you almost never know this. Welch's t-test does not assume equal variances, uses each group's own variance, and performs just as well as the pooled test when variances are equal while protecting you when they are not. Use Welch's by default — the cost of being wrong with the pooled test is higher than the cost of using Welch's unnecessarily.

Welch's t-test:

Welch-Satterthwaite degrees of freedom:

Intuitively: when two groups have unequal variances, the effective information about spread is less than what equal-variance groups would provide. The Satterthwaite formula reduces effective df to reflect that one group's variance estimate dominates the standard error. With and equal , df stays near . When variances differ, df shrinks, producing larger critical values and wider intervals.

For our data with : , .

Since : reject , Model B significantly outperforms A.

Variant 3: Paired T-Test

Use this when the same seeds were used for both models. Pairing eliminates the variance due to seed-to-seed randomness, leaving only the variance in the difference between models. This makes the paired test more powerful than the independent test when the pairing is meaningful.

Differences :

With , : , reject .

The paired t statistic (12.84) is much larger than the Welch's t statistic (7.08) for the same data — because pairing removed the seed-level variance that the independent test had to absorb. Since the same seeds were used, the paired test is the correct choice here.

PhaseFormulaValuesResult
Differences
Std of differencesfrom differences
Paired SE
t statistic

Which T-Test to Use?

SituationTest
One sample vs. known thresholdOne-sample t-test
Two independent groups, possibly unequal varianceWelch's t-test (default)
Two independent groups, variance equality verifiedPooled t-test (rarely needed)
Paired observations (same seeds, same subjects)Paired t-test

Always check independence. If your fold scores are not independent (e.g., data leakage between folds), no t-test variant is valid.

Cohen's d: Effect Size

A significant p-value says an effect is real. Cohen's d says how large it is.

For two independent groups:

For paired design:

Cohen's dInterpretation
0.2Small
0.5Medium
0.8Large

A d of 4.09 is enormous — the augmentation produces an effect many times larger than typical measurement variability. This is practically significant, not just statistically significant.

Non-Parametric Alternatives

When normality is violated — severely skewed scores, heavy outliers, or small — use these instead:

Mann-Whitney U (non-parametric alternative to independent two-sample t-test): tests whether scores from group B tend to be higher than from group A, without assuming normality.

Wilcoxon signed-rank (non-parametric alternative to paired t-test): tests whether the differences tend to be positive, without assuming Normal differences.

python
import numpy as np
from scipy import stats

scores_a = np.array([0.812, 0.821, 0.808, 0.819, 0.815, 0.824])
scores_b = np.array([0.831, 0.847, 0.839, 0.844, 0.836, 0.843])
differences = scores_b - scores_a

# One-sample t-test: is Model A above 0.80?
t1, p1 = stats.ttest_1samp(scores_a, popmean=0.80)
print(f"One-sample t-test (A vs 0.80): t={t1:.3f}, p={p1:.4f}")

# Welch's t-test (independent, unequal variance)
t2, p2 = stats.ttest_ind(scores_b, scores_a, equal_var=False)
print(f"Welch's t-test (B vs A, independent): t={t2:.3f}, p={p2:.4f}")

# Paired t-test (same seeds)
t3, p3 = stats.ttest_rel(scores_b, scores_a)
print(f"Paired t-test (B vs A, paired): t={t3:.3f}, p={p3:.4f}")

# Cohen's d (paired)
cohens_d_paired = differences.mean() / differences.std(ddof=1)
print(f"Cohen's d (paired): {cohens_d_paired:.3f}")

# Non-parametric alternatives
u_stat, p_mann = stats.mannwhitneyu(scores_b, scores_a, alternative='two-sided')
print(f"\nMann-Whitney U (non-parametric independent): U={u_stat}, p={p_mann:.4f}")

w_stat, p_wilcox = stats.wilcoxon(scores_b, scores_a)
print(f"Wilcoxon signed-rank (non-parametric paired): W={w_stat}, p={p_wilcox:.4f}")
One-sample t-test (A vs 0.80): t=7.091, p=0.0009 Welch's t-test (B vs A, independent): t=7.076, p=0.0001 Paired t-test (B vs A, paired): t=12.843, p=0.0000 Cohen's d (paired): 5.232 Mann-Whitney U (non-parametric independent): U=36.0, p=0.0040 Wilcoxon signed-rank (non-parametric paired): W=21.0, p=0.0313

Common Mistakes

Using pooled t-test when variances differ: Welch's test does not assume equal variances — use it by default. Running Levene's test first and then choosing the test based on the result is itself a form of double-dipping; just use Welch's.

Treating paired data as independent: This discards the within-pair information and reduces power. For same-seed experiments, always use the paired test.

Ignoring effect size: A significant t-test with Cohen's d = 0.1 is a different finding than . Report both.

One-tailed tests after seeing data: If you did not pre-specify a one-tailed test, do not use one post-hoc.

Confidence Interval for the Difference

The CI for the mean difference (paired):

The augmentation pipeline improves F1 by between 1.88 and 2.82 percentage points with 95% confidence. This is more useful than the p-value alone — it shows the range of plausible effect sizes.

The t-test applies the t-distribution (post 6) to hypothesis testing problems where is unknown. It extends the Z-test (post 5) by replacing known variance with estimated variance. The paired t-test is the most common form; when measurements are repeated over time, repeated measures ANOVA (post 16) generalizes it to more than two time points. The comparison between Z and t (post 8) develops the convergence of these two tests as sample size grows. Cohen's d connects to effect size reporting and power analysis (post 9), which tells you how large a sample you need to reliably detect a given effect.

Honest Limitations

T-tests assume independent observations and approximately Normal data (or large enough ). Cross-validation fold scores violate independence when folds share training data — adjacent folds use overlapping training examples. Reporting paired t-tests on standard k-fold scores can be anti-conservative. For rigorous model comparison, consider the corrected repeated k-fold cross-validation test (Bouckaert & Frank, 2004) or simply report effect sizes with bootstrap confidence intervals.

Test Your Understanding

  1. You want to test whether Model A improves over a published baseline of F1 = 0.79. Which t-test variant do you use, and what are the hypotheses?
  2. For the paired design, why is the paired t statistic (12.84) so much larger than the Welch's statistic (7.08) for the same data? Work through the intuition in terms of what variance each test accounts for.
  3. You find Welch's t = 2.3 with and Cohen's d = 0.3 for a model comparison. A colleague calls this "a real improvement." What additional context do you need before agreeing?
  4. Model C is evaluated on 5 seeds with F1 scores . The mean is 0.772 with high variance. Would you trust a paired t-test comparing this to Model A? What alternative would you use?
  5. Explain in plain language why Welch's degrees of freedom can be lower than , and what practical consequence this has for the critical value and p-value.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment