~/blog

Chi-Square Test of Independence

Apr 11, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

Your text classifier assigns documents to three categories: A, B, or C. You collect 170 labeled test documents and record what the model predicted vs what the true label was. The question: is the model's predicted class related to the true class at all — or are predictions essentially random with respect to the actual label?

When you have two categorical variables and want to know if they are associated, the chi-square test of independence is the tool.

What Independence Means

H₀: The row variable (actual class) and column variable (predicted class) are independent — knowing the actual class gives no information about which class the model predicts. The model's predictions are statistically unrelated to the truth.

H₁: The variables are dependent — there is a statistical association. Predictions and true classes are related.

The inversion: for most hypothesis tests, we hope not to reject H₀ (we want the drug not to have side effects, or the groups not to differ). For a classifier, we want to reject H₀. Failing to reject means the model is not doing better than random chance. A significant chi-square here is evidence that the model is working.

This same framework applies to non-classifier questions — e.g., "Is political affiliation associated with social media platform?" — with no natural predictor/outcome distinction.

The Anchor: 3×3 Contingency Table

text

Predicted A   Predicted B   Predicted C   Row Total
Actual A           45            10             5            60
Actual B           12            38            10            60
Actual C            8             7            35            50
Column Total       65            55            50           170

Row variable: actual class (A, B, C) Column variable: predicted class (A, B, C) Cell O_ij: count of documents with actual class i and predicted class j N = 170 documents total

Expected Frequencies Under Independence

Derivation — not just a formula:

If actual class and predicted class are independent: P(row i AND col j) = P(row i) × P(col j)

P(row i) = R_i / N (proportion of actual class i) P(col j) = C_j / N (proportion of predicted class j)

Expected count: E_ij = N × (R_i/N) × (C_j/N) = R_i × C_j / N

This is independence directly. No memorization needed — it comes straight from the definition.

All 9 expected frequencies:

Cell	Formula	E_ij
(A, pred A)	60 × 65 / 170	22.94
(A, pred B)	60 × 55 / 170	19.41
(A, pred C)	60 × 50 / 170	17.65
(B, pred A)	60 × 65 / 170	22.94
(B, pred B)	60 × 55 / 170	19.41
(B, pred C)	60 × 50 / 170	17.65
(C, pred A)	50 × 65 / 170	19.12
(C, pred B)	50 × 55 / 170	16.18
(C, pred C)	50 × 50 / 170	14.71

All E_ij > 5 — chi-square approximation is valid.

The Test Statistic

χ² = Σᵢⱼ (O_ij − E_ij)² / E_ij

Why this formula: squaring amplifies large deviations. Dividing by E_ij normalizes: a 10-unit gap from E=15 is far more surprising than from E=1000. The sum follows a chi-square distribution under H₀ — see the chi-square distribution post for the derivation.

All 9 cells computed:

Cell	O	E	O−E	(O−E)²	(O−E)²/E
(A, pred A)	45	22.94	22.06	486.64	21.21
(A, pred B)	10	19.41	−9.41	88.55	4.56
(A, pred C)	5	17.65	−12.65	160.02	9.07
(B, pred A)	12	22.94	−10.94	119.68	5.22
(B, pred B)	38	19.41	18.59	345.59	17.81
(B, pred C)	10	17.65	−7.65	58.52	3.31
(C, pred A)	8	19.12	−11.12	123.65	6.47
(C, pred B)	7	16.18	−9.18	84.27	5.21
(C, pred C)	35	14.71	20.29	411.69	27.99

χ² = 21.21 + 4.56 + 9.07 + 5.22 + 17.81 + 3.31 + 6.47 + 5.21 + 27.99 = 100.85

Degrees of Freedom

df = (r − 1) × (c − 1)

Derivation: the table has r × c = 9 cells. Given row totals (r constraints) and column totals (c constraints), with the grand total tying them together, you have r + c − 1 = 5 independent constraints. Free cells = 9 − 5 = 4 = (3−1)(3−1).

For a 2×2: df = 1. For a 2×3: df = 2. For our 3×3: df = 4.

Finding the p-Value

p = P(χ²(4) > 100.85)

χ² critical value at α=0.05, df=4: χ²_crit = 9.488

100.85 >> 9.488. p ≈ 0 (p < 0.0001). Reject H₀.

The model's predictions are statistically dependent on the actual class — the classifier is working.

Yates' Continuity Correction (2×2 Tables Only)

For 2×2 tables with small expected frequencies, Yates' correction reduces the statistic slightly:

χ²_corrected = Σᵢⱼ (|O_ij − E_ij| − 0.5)² / E_ij

This corrects for the discrete-to-continuous approximation error. scipy.stats.chi2_contingency applies it by default for 2×2 tables (correction=True). For larger tables (2×3, 3×3, etc.), set correction=False.

Fisher's Exact Test (Small Expected Frequencies)

When any E_ij < 5 (or more than 20% of cells have E_ij < 5), the chi-square approximation is unreliable. Fisher's exact test computes the exact p-value using the hypergeometric distribution.

Fisher's exact is available only for 2×2 tables. For larger tables with small expected counts: combine categories with domain knowledge, or use Monte Carlo p-values via chi2_contingency(lambda_='log-likelihood').

Rule: if E_ij < 5 in any cell, look for Fisher's exact (2×2) or Monte Carlo chi-square (larger tables).

Effect Size: Cramér's V

χ² significant → the association is real. Cramér's V → how strong:

V = √(χ² / (N × (min(r, c) − 1))) = √(100.85 / (170 × (3−1))) = √(100.85 / 340) = √0.2966 = 0.545

Cramér's V (for min(r,c)−1=2)	Interpretation
< 0.10	Small
0.10–0.30	Moderate
> 0.30	Large

V = 0.545 — large effect. The model's predictions are strongly associated with the actual class.

Standardized Residuals

After rejecting H₀, identify which cells drive the association:

standardized_residual_ij = (O_ij − E_ij) / √E_ij

Values > |2| indicate a cell contributing significantly. Values > |3| are particularly notable.

For the anchor, the three diagonal cells (correct predictions: A→A=45, B→B=38, C→C=35) will show large positive residuals, and the off-diagonal cells will show large negative residuals. This confirms the model predicts correctly more than independence would expect.

All diagonal residuals > +4 (green), all off-diagonal residuals < −1.8 (red). The model reliably predicts each class — especially class C (residual +5.29). The largest misclassification driver is actual A being predicted as C (residual −3.00).

Code

python

import numpy as np
from scipy import stats

observed = np.array([
    [45, 10,  5],   # Actual A: pred A, pred B, pred C
    [12, 38, 10],   # Actual B
    [ 8,  7, 35],   # Actual C
])

# Chi-square test of independence (no Yates' correction for 3x3)
chi2, p, dof, expected = stats.chi2_contingency(observed, correction=False)
print(f"χ²={chi2:.4f}, df={dof}, p={p:.6f}")
print(f"Critical value (α=0.05, df=4): {stats.chi2.ppf(0.95, df=4):.4f}")
print(f"\nExpected frequencies:")
print(expected.round(2))

# Cramér's V
n = observed.sum()
min_dim = min(observed.shape) - 1
V = np.sqrt(chi2 / (n * min_dim))
print(f"\nCramér's V = {V:.4f}")

# Standardized residuals
std_residuals = (observed - expected) / np.sqrt(expected)
print("\nStandardized residuals:")
print(std_residuals.round(2))

# Fisher's exact (2×2 only — demo)
obs_2x2 = np.array([[45, 15], [20, 90]])
OR, p_fisher = stats.fisher_exact(obs_2x2, alternative='two-sided')
print(f"\nFisher's exact (2×2 demo): OR={OR:.3f}, p={p_fisher:.4f}")