← View series: statistics
~/blog
Covariance and Correlation
Mean and standard deviation each describe a single variable. They tell you where values cluster and how spread out they are — but they say nothing about whether two variables move together. When you want to know whether longer training runs produce better validation accuracy, or whether higher learning rates hurt convergence, a one-variable summary cannot answer that question. You need a measure of joint variation.
Covariance is that measure. Pearson's correlation coefficient is covariance with the units stripped out so you can compare relationships across different pairs of variables.
The Anchor Dataset
Anchor exception: Covariance and correlation require two paired variables, so the series-wide univariate CV accuracy dataset cannot demonstrate the concept. This post uses a bivariate dataset instead.
Seven training runs, each logged with training duration and the resulting validation accuracy:
training_hours = [1, 2, 3, 4, 5, 6, 7]
val_accuracy = [0.61, 0.72, 0.78, 0.83, 0.87, 0.88, 0.90]More training generally helps here — but by how much, and how consistently? Every formula, SVG, and code block that follows uses these exact values.
Phase 1 — Compute the Means
Before measuring joint variation, you need both variables centered. The center of each variable is its mean.
Mean of training_hours:
x̄ = (1 + 2 + 3 + 4 + 5 + 6 + 7) / 7 = 28 / 7 = 4.0 hours
Mean of val_accuracy:
ȳ = (0.61 + 0.72 + 0.78 + 0.83 + 0.87 + 0.88 + 0.90) / 7 = 5.59 / 7 = 0.799
Phase 2 — Compute the Deviation Pairs
Covariance works by asking, for each observation: did both variables deviate from their means in the same direction? Compute the deviation for each variable at each point:
| Run | xᵢ | yᵢ | xᵢ − x̄ | yᵢ − ȳ | (xᵢ − x̄)(yᵢ − ȳ) |
|---|---|---|---|---|---|
| 1 | 1 | 0.61 | −3.000 | −0.189 | +0.566 |
| 2 | 2 | 0.72 | −2.000 | −0.079 | +0.157 |
| 3 | 3 | 0.78 | −1.000 | −0.019 | +0.019 |
| 4 | 4 | 0.83 | 0.000 | +0.031 | 0.000 |
| 5 | 5 | 0.87 | +1.000 | +0.071 | +0.071 |
| 6 | 6 | 0.88 | +2.000 | +0.081 | +0.163 |
| 7 | 7 | 0.90 | +3.000 | +0.101 | +0.303 |
Every product is positive. Run 1 spent fewer hours than average and achieved lower accuracy than average — both deviations negative, their product positive. Run 7 trained longer and scored higher — both deviations positive, product positive. That consistent sign alignment is what covariance captures.
The dashed amber lines mark x̄ and ȳ. Every point sits in the lower-left quadrant (both negative) or the upper-right quadrant (both positive). No point falls in the upper-left or lower-right. That geometric picture is exactly what a large positive covariance looks like.
Phase 3 — Compute the Sample Covariance
Sum the products from Phase 2, then divide by n − 1:
Population covariance (if these 7 runs were the entire population):
Cov(X,Y) = Σ(xᵢ − μₓ)(yᵢ − μᵧ) / N
Sample covariance (treating these runs as a sample of all possible training conditions):
Cov(X,Y) = Σ(xᵢ − x̄)(yᵢ − ȳ) / (n − 1)
The anchor uses sample covariance — we are not exhaustively cataloging every possible training configuration, we sampled 7 runs. Dividing by n − 1 rather than n corrects for the fact that the sample mean is always closer to the sample values than the true population mean would be, which would otherwise cause us to underestimate the true covariance.
Substituting the values:
Cov = (0.566 + 0.157 + 0.019 + 0.000 + 0.071 + 0.163 + 0.303) / (7 − 1)
Cov = 1.279 / 6 = 0.213
This result, 0.213, means that on average each one-hour increase in training time is accompanied by a 0.213 increase in validation accuracy — in the joint unit of hours × accuracy. That last phrase is the problem.
The units problem: covariance has units of (hours × accuracy). If you had measured training time in minutes instead of hours, the covariance would be 60 × 0.213 = 12.78 — a completely different number for an identical relationship. You cannot compare a covariance of 0.213 on one pair of variables to a covariance of 3.7 on another pair. The raw number is uninterpretable without knowing the scales. This is exactly why Pearson's correlation coefficient exists.
Phase 4 — Normalize to Pearson r
Pearson's r divides covariance by the product of the two standard deviations. This removes the units and scales the result to a fixed range of −1 to +1.
r = Cov(X,Y) / (σₓ · σᵧ)
Why σₓ · σᵧ is the right denominator: σₓ · σᵧ is the maximum possible absolute covariance for two variables with those standard deviations — the covariance you would get if the relationship were perfectly linear (r = ±1). Dividing by σₓ · σᵧ normalizes the covariance to a fraction of its maximum, which is why r is always in [−1, 1]. The formal guarantee comes from the Cauchy-Schwarz inequality: |Cov(X,Y)|² ≤ Var(X) · Var(Y), which means |Cov| ≤ σₓ · σᵧ.
Algebraic bridge — the two forms are identical:
Start with the covariance-divided-by-SDs form and substitute the sample variance formulas:
r = Cov(X,Y) / (σₓ · σᵧ)
= [Σ(xᵢ−x̄)(yᵢ−ȳ) / (n−1)]
─────────────────────────────────────────────────
[√(Σ(xᵢ−x̄)²/(n−1))] × [√(Σ(yᵢ−ȳ)²/(n−1))]
= [Σ(xᵢ−x̄)(yᵢ−ȳ) / (n−1)]
────────────────────────────────────────────
[1/(n−1)] × √[Σ(xᵢ−x̄)² × Σ(yᵢ−ȳ)²]
= Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)² × Σ(yᵢ−ȳ)²]
The (n−1) in the numerator and denominator cancel. Both forms are identical — whether you compute covariance first or go straight to the direct formula, you get the same r.
Compute σₓ:
Σ(xᵢ − x̄)² = (−3)² + (−2)² + (−1)² + 0² + 1² + 2² + 3² = 9 + 4 + 1 + 0 + 1 + 4 + 9 = 28
σₓ = √(28 / 6) = √4.667 = 2.160
Compute σᵧ:
Σ(yᵢ − ȳ)² = (−0.189)² + (−0.079)² + (−0.019)² + (0.031)² + (0.071)² + (0.081)² + (0.101)²
= 0.03572 + 0.00624 + 0.00036 + 0.00096 + 0.00504 + 0.00656 + 0.01020 = 0.06509
σᵧ = √(0.06509 / 6) = √0.01085 = 0.1042
Compute r:
r = 0.213 / (2.160 × 0.1042) = 0.213 / 0.2251 = 0.948
The same result follows directly from the raw deviations without going through covariance first:
r = Σ(xᵢ − x̄)(yᵢ − ȳ) / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]
r = 1.279 / √(28 × 0.06509) = 1.279 / √1.8225 = 1.279 / 1.3500 = 0.948
Both routes produce the same r because dividing numerator and denominator by (n−1) cancels out, leaving the formula structurally identical.
Interpretation Scale
r of 0.948 sits well above the threshold for a very strong positive relationship. As a practical guide:
| |r| range | Strength | Example in ML | |------------|---------|---------| | 0.90 − 1.00 | Very strong | Training steps vs training loss on a stable run | | 0.70 − 0.89 | Strong | Dataset size vs model accuracy (common finding) | | 0.50 − 0.69 | Moderate | Regularization strength vs generalization gap | | 0.00 − 0.49 | Weak to none | Random seed vs final accuracy |
At r = 0.948, the relationship between training duration and validation accuracy in this dataset is very strong. Adding one more hour of training reliably moves accuracy upward — though the diminishing returns are visible: run 1 to run 2 gained 0.11, while run 6 to run 7 gained only 0.02.
Trace Table
| Phase | Formula | Values | Result |
|---|---|---|---|
| Mean of X | Σxᵢ / n | (1+2+3+4+5+6+7) / 7 | 4.000 |
| Mean of Y | Σyᵢ / n | (0.61+…+0.90) / 7 | 0.799 |
| Sum of products | Σ(xᵢ−x̄)(yᵢ−ȳ) | 0.566+0.157+…+0.303 | 1.279 |
| Sample covariance | Σ products / (n−1) | 1.279 / 6 | 0.213 |
| σₓ | √[Σ(xᵢ−x̄)² / (n−1)] | √(28/6) | 2.160 |
| σᵧ | √[Σ(yᵢ−ȳ)² / (n−1)] | √(0.06509/6) | 0.104 |
| Pearson r | Cov / (σₓ·σᵧ) | 0.213 / (2.160 × 0.104) | 0.948 |
Code
import math
training_hours = [1, 2, 3, 4, 5, 6, 7]
val_accuracy = [0.61, 0.72, 0.78, 0.83, 0.87, 0.88, 0.90]
n = len(training_hours)
x_mean = sum(training_hours) / n
y_mean = sum(val_accuracy) / n
deviations_x = [x - x_mean for x in training_hours]
deviations_y = [y - y_mean for y in val_accuracy]
sum_products = sum(dx * dy for dx, dy in zip(deviations_x, deviations_y))
cov = sum_products / (n - 1)
var_x = sum(dx**2 for dx in deviations_x) / (n - 1)
var_y = sum(dy**2 for dy in deviations_y) / (n - 1)
sx = math.sqrt(var_x)
sy = math.sqrt(var_y)
r = cov / (sx * sy)
print(f"x̄ = {x_mean:.3f}, ȳ = {y_mean:.3f}")
print(f"Sample covariance: {cov:.4f}")
print(f"σₓ = {sx:.4f}, σᵧ = {sy:.4f}")
print(f"Pearson r = {r:.4f}")x̄ = 4.000, ȳ = 0.799
Sample covariance: 0.2133
σₓ = 2.1602, σᵧ = 0.1042
Pearson r = 0.9482Edge Cases
r = +1.0 — Perfect Positive Linear Relationship
Every point lies exactly on a line with a positive slope. For y = 2x + 1 with x = [1, 2, 3, 4, 5]:
import math
x = [1, 2, 3, 4, 5]
y = [2*xi + 1 for xi in x]
n = len(x)
xm = sum(x)/n; ym = sum(y)/n
dx = [xi - xm for xi in x]; dy = [yi - ym for yi in y]
cov = sum(a*b for a,b in zip(dx,dy))/(n-1)
sx = math.sqrt(sum(d**2 for d in dx)/(n-1))
sy = math.sqrt(sum(d**2 for d in dy)/(n-1))
print(f"r = {cov/(sx*sy):.4f}")r = 1.0000r = −1.0 — Perfect Negative Linear Relationship
Every point lies on a line with a negative slope. For y = -2x + 11:
y_neg = [-2*xi + 11 for xi in x]
dy_neg = [yi - sum(y_neg)/n for yi in y_neg]
cov_neg = sum(a*b for a,b in zip(dx,dy_neg))/(n-1)
sy_neg = math.sqrt(sum(d**2 for d in dy_neg)/(n-1))
print(f"r = {cov_neg/(sx*sy_neg):.4f}")r = -1.0000These perfect cases never appear in real data — real measurements have noise. But they define the scale: +1 and −1 are the absolute upper and lower bounds. Any real dataset with outliers or noise will produce |r| < 1.
r = 0 Does Not Mean No Relationship
This is the most consequential thing to understand about Pearson r. Consider seven model evaluations where the input is the regularization offset from zero:
reg_offset = [-3, -2, -1, 0, 1, 2, 3]
val_loss = [9, 4, 1, 0, 1, 4, 9]
n = len(reg_offset)
xm = sum(reg_offset) / n
ym = sum(val_loss) / n
dx = [x - xm for x in reg_offset]
dy = [y - ym for y in val_loss]
cov_q = sum(a*b for a,b in zip(dx,dy)) / (n-1)
sx_q = math.sqrt(sum(d**2 for d in dx) / (n-1))
sy_q = math.sqrt(sum(d**2 for d in dy) / (n-1))
r_q = cov_q / (sx_q * sy_q)
print(f"r = {r_q:.4f}")r = 0.0000Pearson r is exactly 0, but the relationship is val_loss = reg_offset² — perfectly determined. The positive products from the right side (both variables increase) exactly cancel the negative products from the left side (one increases while the other decreases), driving the sum to zero.
Red points are in the negative-product quadrants (upper-left), amber are in the positive-product quadrants (upper-right and lower-left). The symmetry is exact: the negatives and positives cancel. Pearson r only detects linear association.
Outlier Sensitivity
A single anomalous data point can reverse the correlation entirely. Add one run where a very long session (12 hours) produced unexpectedly low accuracy (0.50, perhaps from overfitting):
x_with_outlier = training_hours + [12]
y_with_outlier = val_accuracy + [0.50]
n2 = len(x_with_outlier)
xm2 = sum(x_with_outlier) / n2
ym2 = sum(y_with_outlier) / n2
dx2 = [x - xm2 for x in x_with_outlier]
dy2 = [y - ym2 for y in y_with_outlier]
cov2 = sum(a*b for a,b in zip(dx2,dy2)) / (n2-1)
sx2 = math.sqrt(sum(d**2 for d in dx2) / (n2-1))
sy2 = math.sqrt(sum(d**2 for d in dy2) / (n2-1))
r2 = cov2 / (sx2 * sy2)
print(f"r with outlier = {r2:.4f}")r with outlier = -0.2336One point flipped the correlation from +0.948 to −0.234. The outlier at (12, 0.50) sits in the upper-left quadrant relative to the new means — it contributes a large negative product that overwhelms the six consistent positive products from the original data.
Related Concepts and Limitations
Covariance and Pearson r build on variance and standard deviation, which you need to understand before these measures make sense. They unlock linear regression: the slope coefficient in simple linear regression is exactly Cov(X,Y) / Var(X), and the R² statistic is r². Understanding r makes regression coefficients interpretable.
Spearman's ρ (rank correlation) is the robust alternative when your data has outliers or the relationship is monotone but not linear. Instead of computing r on the raw values, Spearman first replaces each value with its rank, then runs the standard Pearson formula on those ranks. The shortcut formula is:
ρ = 1 − (6 Σdᵢ²) / (n(n² − 1))
where dᵢ is the difference in ranks for the i-th pair. On the anchor, both training_hours and val_accuracy are already in strictly increasing order, so every rank difference dᵢ = 0:
ρ = 1 − (6 × 0) / (7 × 48) = 1 − 0 = 1.000
Pearson r was 0.948; Spearman ρ is 1.000. The gap tells you the relationship is monotone but not perfectly linear — the accuracy gains per extra hour diminish over time (a concave curve), which Pearson r penalizes slightly because it expects a straight line. When Pearson r and Spearman ρ differ significantly, the relationship is either non-linear or influenced by outliers.
ranks_x = [1, 2, 3, 4, 5, 6, 7]
ranks_y = [1, 2, 3, 4, 5, 6, 7]
d_sq = sum((rx - ry)**2 for rx, ry in zip(ranks_x, ranks_y))
rho = 1 - (6 * d_sq) / (n * (n**2 - 1))
print(f"Spearman rho = {rho:.4f}")Spearman rho = 1.0000Kendall's τ (concordance) takes a different approach. It counts concordant pairs — pairs where both xᵢ > xⱼ and yᵢ > yⱼ — minus discordant pairs, then divides by the total number of pairs. On the anchor all 21 pairs are concordant (no inversions exist in a strictly increasing sequence), giving τ = (21 − 0) / 21 = 1.000. Kendall's τ is more interpretable than Spearman for small samples: τ = 0.6 means 60% of all pairwise comparisons agree in direction. Use it when you have ordinal data, small samples, or want a probability interpretation of the relationship's direction.
concordant = discordant = 0
for i in range(n):
for j in range(i + 1, n):
xi, yi = training_hours[i], val_accuracy[i]
xj, yj = training_hours[j], val_accuracy[j]
if (xi - xj) * (yi - yj) > 0:
concordant += 1
elif (xi - xj) * (yi - yj) < 0:
discordant += 1
tau = (concordant - discordant) / (n * (n - 1) // 2)
print(f"Kendall tau = {tau:.4f} (concordant={concordant}, discordant={discordant})")Kendall tau = 1.0000 (concordant=21, discordant=0)Point-biserial r handles the case where one variable is continuous and the other is binary (0/1). It is mathematically identical to Pearson r computed on a binary-coded variable — no special formula is needed. The ML use case is common: correlate a continuous feature (e.g. training_hours) with a binary label (e.g. did_converge ∈ {0, 1}). The result sits on the same −1 to +1 scale. Point-biserial r is also related to Cohen's d via r_pb = d / √(d² + (n₁ + n₂)² / (n₁n₂)), which means a significant point-biserial r implies the two groups differ in their means by a meaningful number of standard deviations.
Autocorrelation measures how a single time series correlates with a lagged copy of itself: r_k = Cov(Xₜ, Xₜ₋ₖ) / Var(X). In ML this matters everywhere — training loss curves, A/B test metrics measured daily, model monitoring dashboards are all time series. Autocorrelated observations violate the independence assumption baked into most hypothesis tests, inflating false-positive rates. For a full treatment, see the time series post in this series.
| Measure | What It Measures | Scale | Outlier Resistant? | Use When |
|---|---|---|---|---|
| Covariance | Joint variation in raw units | Unbounded | No | You need to feed it into another formula (e.g. regression slope) |
| Pearson r | Linear association, unit-free | −1 to +1 | No | Relationship is approximately linear, no severe outliers |
| Spearman ρ | Monotone association via ranks | −1 to +1 | Yes | Outliers present, or relationship is monotone but curved |
Pearson r cannot detect non-linear relationships. A value near zero does not rule out a strong curved or step-function relationship — always plot the data before reporting a correlation. The quadratic example above (r = 0.000 for val_loss = reg_offset²) is not contrived — this pattern appears whenever you push a regularization coefficient too high or too low.
Correlation is not causation. r = 0.948 between training_hours and val_accuracy does not mean more training causes better accuracy. A confound could produce the same correlation — for example, if longer training runs were also given larger datasets. Always examine the data-generating process before drawing causal conclusions.
Test Your Understanding
Level 1 — Conceptual: If you multiply every training_hours value by 60 (converting hours to minutes), what happens to the sample covariance? What happens to Pearson r? Explain why.
Level 2 — Applied: The covariance between two variables is −0.45. Their standard deviations are σₓ = 1.5 and σᵧ = 0.4. Compute Pearson r by hand and describe the strength and direction of the relationship.
Level 3 — Reasoning: A researcher reports r = 0.02 between learning_rate and final validation loss for 200 training runs. A colleague says "there's no relationship." What should the researcher check before agreeing?
Level 4 — Design: You want to know whether GPU memory usage correlates with inference latency across 50 model deployments, but three deployments have GPU memory usage 10× higher than the rest due to a configuration bug. Should you use Pearson r or Spearman ρ? What would happen to Pearson r if you included those three deployments versus excluded them?