Back to blog
← View series: statistics

~/blog

Covariance and Correlation

Jun 14, 202619 min readBy Mohammed Vasim
StatisticsMathData Science

Mean and standard deviation each describe a single variable. They tell you where values cluster and how spread out they are — but they say nothing about whether two variables move together. When you want to know whether longer training runs produce better validation accuracy, or whether higher learning rates hurt convergence, a one-variable summary cannot answer that question. You need a measure of joint variation.

Covariance is that measure. Pearson's correlation coefficient is covariance with the units stripped out so you can compare relationships across different pairs of variables.

The Anchor Dataset

Anchor exception: Covariance and correlation require two paired variables, so the series-wide univariate CV accuracy dataset cannot demonstrate the concept. This post uses a bivariate dataset instead.

Seven training runs, each logged with training duration and the resulting validation accuracy:

python
training_hours = [1, 2, 3, 4, 5, 6, 7]
val_accuracy   = [0.61, 0.72, 0.78, 0.83, 0.87, 0.88, 0.90]

More training generally helps here — but by how much, and how consistently? Every formula, SVG, and code block that follows uses these exact values.

Phase 1 — Compute the Means

Before measuring joint variation, you need both variables centered. The center of each variable is its mean.

Mean of training_hours:

x̄ = (1 + 2 + 3 + 4 + 5 + 6 + 7) / 7 = 28 / 7 = 4.0 hours

Mean of val_accuracy:

ȳ = (0.61 + 0.72 + 0.78 + 0.83 + 0.87 + 0.88 + 0.90) / 7 = 5.59 / 7 = 0.799

Phase 1 — Means training_hours 1 2 3 4 5 6 7 x̄ = 4.0 val_accuracy 0.61 0.72 0.78 0.83 0.87 0.88 0.90 ȳ = 0.799 Both variables centered at their means — ready for deviation pairs

Phase 2 — Compute the Deviation Pairs

Covariance works by asking, for each observation: did both variables deviate from their means in the same direction? Compute the deviation for each variable at each point:

Runxᵢyᵢxᵢ − x̄yᵢ − ȳ(xᵢ − x̄)(yᵢ − ȳ)
110.61−3.000−0.189+0.566
220.72−2.000−0.079+0.157
330.78−1.000−0.019+0.019
440.830.000+0.0310.000
550.87+1.000+0.071+0.071
660.88+2.000+0.081+0.163
770.90+3.000+0.101+0.303

Every product is positive. Run 1 spent fewer hours than average and achieved lower accuracy than average — both deviations negative, their product positive. Run 7 trained longer and scored higher — both deviations positive, product positive. That consistent sign alignment is what covariance captures.

Phase 2 — Deviation Quadrants training_hours val_accuracy 0.60 0.70 0.80 0.90 1 2 3 4 5 6 7 x̄=4 ȳ=0.799 + +

The dashed amber lines mark x̄ and ȳ. Every point sits in the lower-left quadrant (both negative) or the upper-right quadrant (both positive). No point falls in the upper-left or lower-right. That geometric picture is exactly what a large positive covariance looks like.

Phase 3 — Compute the Sample Covariance

Sum the products from Phase 2, then divide by n − 1:

Population covariance (if these 7 runs were the entire population):

Cov(X,Y) = Σ(xᵢ − μₓ)(yᵢ − μᵧ) / N

Sample covariance (treating these runs as a sample of all possible training conditions):

Cov(X,Y) = Σ(xᵢ − x̄)(yᵢ − ȳ) / (n − 1)

The anchor uses sample covariance — we are not exhaustively cataloging every possible training configuration, we sampled 7 runs. Dividing by n − 1 rather than n corrects for the fact that the sample mean is always closer to the sample values than the true population mean would be, which would otherwise cause us to underestimate the true covariance.

Substituting the values:

Cov = (0.566 + 0.157 + 0.019 + 0.000 + 0.071 + 0.163 + 0.303) / (7 − 1)

Cov = 1.279 / 6 = 0.213

Phase 3 — Deviation Products (bars) → Covariance Run number 0 0.2 0.4 0.6 0.566 1 0.157 2 0.019 3 0.000 4 0.071 5 0.163 6 0.303 7 Σ = 1.279 ÷ 6 = 0.213

This result, 0.213, means that on average each one-hour increase in training time is accompanied by a 0.213 increase in validation accuracy — in the joint unit of hours × accuracy. That last phrase is the problem.

The units problem: covariance has units of (hours × accuracy). If you had measured training time in minutes instead of hours, the covariance would be 60 × 0.213 = 12.78 — a completely different number for an identical relationship. You cannot compare a covariance of 0.213 on one pair of variables to a covariance of 3.7 on another pair. The raw number is uninterpretable without knowing the scales. This is exactly why Pearson's correlation coefficient exists.

Phase 4 — Normalize to Pearson r

Pearson's r divides covariance by the product of the two standard deviations. This removes the units and scales the result to a fixed range of −1 to +1.

r = Cov(X,Y) / (σₓ · σᵧ)

Why σₓ · σᵧ is the right denominator: σₓ · σᵧ is the maximum possible absolute covariance for two variables with those standard deviations — the covariance you would get if the relationship were perfectly linear (r = ±1). Dividing by σₓ · σᵧ normalizes the covariance to a fraction of its maximum, which is why r is always in [−1, 1]. The formal guarantee comes from the Cauchy-Schwarz inequality: |Cov(X,Y)|² ≤ Var(X) · Var(Y), which means |Cov| ≤ σₓ · σᵧ.

Algebraic bridge — the two forms are identical:

Start with the covariance-divided-by-SDs form and substitute the sample variance formulas:

r = Cov(X,Y) / (σₓ · σᵧ) = [Σ(xᵢ−x̄)(yᵢ−ȳ) / (n−1)] ───────────────────────────────────────────────── [√(Σ(xᵢ−x̄)²/(n−1))] × [√(Σ(yᵢ−ȳ)²/(n−1))] = [Σ(xᵢ−x̄)(yᵢ−ȳ) / (n−1)] ──────────────────────────────────────────── [1/(n−1)] × √[Σ(xᵢ−x̄)² × Σ(yᵢ−ȳ)²] = Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)² × Σ(yᵢ−ȳ)²]

The (n−1) in the numerator and denominator cancel. Both forms are identical — whether you compute covariance first or go straight to the direct formula, you get the same r.

Compute σₓ:

Σ(xᵢ − x̄)² = (−3)² + (−2)² + (−1)² + 0² + 1² + 2² + 3² = 9 + 4 + 1 + 0 + 1 + 4 + 9 = 28

σₓ = √(28 / 6) = √4.667 = 2.160

Compute σᵧ:

Σ(yᵢ − ȳ)² = (−0.189)² + (−0.079)² + (−0.019)² + (0.031)² + (0.071)² + (0.081)² + (0.101)²

= 0.03572 + 0.00624 + 0.00036 + 0.00096 + 0.00504 + 0.00656 + 0.01020 = 0.06509

σᵧ = √(0.06509 / 6) = √0.01085 = 0.1042

Compute r:

r = 0.213 / (2.160 × 0.1042) = 0.213 / 0.2251 = 0.948

The same result follows directly from the raw deviations without going through covariance first:

r = Σ(xᵢ − x̄)(yᵢ − ȳ) / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]

r = 1.279 / √(28 × 0.06509) = 1.279 / √1.8225 = 1.279 / 1.3500 = 0.948

Both routes produce the same r because dividing numerator and denominator by (n−1) cancels out, leaving the formula structurally identical.

Phase 4 — Pearson r = 0.948 training_hours val_accuracy 0.60 0.70 0.80 0.90 1.00 1 2 3 4 5 6 7 0.61 0.72 0.78 0.83 0.87 0.88 0.90 r = 0.948 very strong positive

Interpretation Scale

r of 0.948 sits well above the threshold for a very strong positive relationship. As a practical guide:

| |r| range | Strength | Example in ML | |------------|---------|---------| | 0.90 − 1.00 | Very strong | Training steps vs training loss on a stable run | | 0.70 − 0.89 | Strong | Dataset size vs model accuracy (common finding) | | 0.50 − 0.69 | Moderate | Regularization strength vs generalization gap | | 0.00 − 0.49 | Weak to none | Random seed vs final accuracy |

At r = 0.948, the relationship between training duration and validation accuracy in this dataset is very strong. Adding one more hour of training reliably moves accuracy upward — though the diminishing returns are visible: run 1 to run 2 gained 0.11, while run 6 to run 7 gained only 0.02.

Trace Table

PhaseFormulaValuesResult
Mean of XΣxᵢ / n(1+2+3+4+5+6+7) / 74.000
Mean of YΣyᵢ / n(0.61+…+0.90) / 70.799
Sum of productsΣ(xᵢ−x̄)(yᵢ−ȳ)0.566+0.157+…+0.3031.279
Sample covarianceΣ products / (n−1)1.279 / 60.213
σₓ√[Σ(xᵢ−x̄)² / (n−1)]√(28/6)2.160
σᵧ√[Σ(yᵢ−ȳ)² / (n−1)]√(0.06509/6)0.104
Pearson rCov / (σₓ·σᵧ)0.213 / (2.160 × 0.104)0.948

Code

python
import math

training_hours = [1, 2, 3, 4, 5, 6, 7]
val_accuracy   = [0.61, 0.72, 0.78, 0.83, 0.87, 0.88, 0.90]

n = len(training_hours)
x_mean = sum(training_hours) / n
y_mean = sum(val_accuracy) / n

deviations_x = [x - x_mean for x in training_hours]
deviations_y = [y - y_mean for y in val_accuracy]

sum_products = sum(dx * dy for dx, dy in zip(deviations_x, deviations_y))
cov = sum_products / (n - 1)

var_x = sum(dx**2 for dx in deviations_x) / (n - 1)
var_y = sum(dy**2 for dy in deviations_y) / (n - 1)
sx = math.sqrt(var_x)
sy = math.sqrt(var_y)

r = cov / (sx * sy)

print(f"x̄ = {x_mean:.3f}, ȳ = {y_mean:.3f}")
print(f"Sample covariance: {cov:.4f}")
print(f"σₓ = {sx:.4f}, σᵧ = {sy:.4f}")
print(f"Pearson r = {r:.4f}")
text
x̄ = 4.000, ȳ = 0.799
Sample covariance: 0.2133
σₓ = 2.1602, σᵧ = 0.1042
Pearson r = 0.9482

Edge Cases

r = +1.0 — Perfect Positive Linear Relationship

Every point lies exactly on a line with a positive slope. For y = 2x + 1 with x = [1, 2, 3, 4, 5]:

python
import math

x = [1, 2, 3, 4, 5]
y = [2*xi + 1 for xi in x]
n = len(x)
xm = sum(x)/n; ym = sum(y)/n
dx = [xi - xm for xi in x]; dy = [yi - ym for yi in y]
cov = sum(a*b for a,b in zip(dx,dy))/(n-1)
sx = math.sqrt(sum(d**2 for d in dx)/(n-1))
sy = math.sqrt(sum(d**2 for d in dy)/(n-1))
print(f"r = {cov/(sx*sy):.4f}")
text
r = 1.0000

r = −1.0 — Perfect Negative Linear Relationship

Every point lies on a line with a negative slope. For y = -2x + 11:

python
y_neg = [-2*xi + 11 for xi in x]
dy_neg = [yi - sum(y_neg)/n for yi in y_neg]
cov_neg = sum(a*b for a,b in zip(dx,dy_neg))/(n-1)
sy_neg = math.sqrt(sum(d**2 for d in dy_neg)/(n-1))
print(f"r = {cov_neg/(sx*sy_neg):.4f}")
text
r = -1.0000

These perfect cases never appear in real data — real measurements have noise. But they define the scale: +1 and −1 are the absolute upper and lower bounds. Any real dataset with outliers or noise will produce |r| < 1.

r = 0 Does Not Mean No Relationship

This is the most consequential thing to understand about Pearson r. Consider seven model evaluations where the input is the regularization offset from zero:

python
reg_offset = [-3, -2, -1, 0, 1, 2, 3]
val_loss    = [9,   4,  1, 0, 1, 4, 9]

n = len(reg_offset)
xm = sum(reg_offset) / n
ym = sum(val_loss) / n
dx = [x - xm for x in reg_offset]
dy = [y - ym for y in val_loss]
cov_q = sum(a*b for a,b in zip(dx,dy)) / (n-1)
sx_q = math.sqrt(sum(d**2 for d in dx) / (n-1))
sy_q = math.sqrt(sum(d**2 for d in dy) / (n-1))
r_q = cov_q / (sx_q * sy_q)
print(f"r = {r_q:.4f}")
text
r = 0.0000

Pearson r is exactly 0, but the relationship is val_loss = reg_offset² — perfectly determined. The positive products from the right side (both variables increase) exactly cancel the negative products from the left side (one increases while the other decreases), driving the sum to zero.

Edge Case — r = 0 Yet Perfectly Determined reg_offset 0 3 6 9 -3 -2 -1 0 1 2 3 x̄=0 ȳ≈4.57 9 4 1 0 1 4 9 r = 0.000

Red points are in the negative-product quadrants (upper-left), amber are in the positive-product quadrants (upper-right and lower-left). The symmetry is exact: the negatives and positives cancel. Pearson r only detects linear association.

Outlier Sensitivity

A single anomalous data point can reverse the correlation entirely. Add one run where a very long session (12 hours) produced unexpectedly low accuracy (0.50, perhaps from overfitting):

python
x_with_outlier = training_hours + [12]
y_with_outlier = val_accuracy   + [0.50]

n2 = len(x_with_outlier)
xm2 = sum(x_with_outlier) / n2
ym2 = sum(y_with_outlier) / n2
dx2 = [x - xm2 for x in x_with_outlier]
dy2 = [y - ym2 for y in y_with_outlier]
cov2 = sum(a*b for a,b in zip(dx2,dy2)) / (n2-1)
sx2 = math.sqrt(sum(d**2 for d in dx2) / (n2-1))
sy2 = math.sqrt(sum(d**2 for d in dy2) / (n2-1))
r2 = cov2 / (sx2 * sy2)
print(f"r with outlier = {r2:.4f}")
text
r with outlier = -0.2336

One point flipped the correlation from +0.948 to −0.234. The outlier at (12, 0.50) sits in the upper-left quadrant relative to the new means — it contributes a large negative product that overwhelms the six consistent positive products from the original data.

Covariance and Pearson r build on variance and standard deviation, which you need to understand before these measures make sense. They unlock linear regression: the slope coefficient in simple linear regression is exactly Cov(X,Y) / Var(X), and the R² statistic is r². Understanding r makes regression coefficients interpretable.

Spearman's ρ (rank correlation) is the robust alternative when your data has outliers or the relationship is monotone but not linear. Instead of computing r on the raw values, Spearman first replaces each value with its rank, then runs the standard Pearson formula on those ranks. The shortcut formula is:

ρ = 1 − (6 Σdᵢ²) / (n(n² − 1))

where dᵢ is the difference in ranks for the i-th pair. On the anchor, both training_hours and val_accuracy are already in strictly increasing order, so every rank difference dᵢ = 0:

ρ = 1 − (6 × 0) / (7 × 48) = 1 − 0 = 1.000

Pearson r was 0.948; Spearman ρ is 1.000. The gap tells you the relationship is monotone but not perfectly linear — the accuracy gains per extra hour diminish over time (a concave curve), which Pearson r penalizes slightly because it expects a straight line. When Pearson r and Spearman ρ differ significantly, the relationship is either non-linear or influenced by outliers.

python
ranks_x = [1, 2, 3, 4, 5, 6, 7]
ranks_y = [1, 2, 3, 4, 5, 6, 7]
d_sq = sum((rx - ry)**2 for rx, ry in zip(ranks_x, ranks_y))
rho = 1 - (6 * d_sq) / (n * (n**2 - 1))
print(f"Spearman rho = {rho:.4f}")
text
Spearman rho = 1.0000

Kendall's τ (concordance) takes a different approach. It counts concordant pairs — pairs where both xᵢ > xⱼ and yᵢ > yⱼ — minus discordant pairs, then divides by the total number of pairs. On the anchor all 21 pairs are concordant (no inversions exist in a strictly increasing sequence), giving τ = (21 − 0) / 21 = 1.000. Kendall's τ is more interpretable than Spearman for small samples: τ = 0.6 means 60% of all pairwise comparisons agree in direction. Use it when you have ordinal data, small samples, or want a probability interpretation of the relationship's direction.

python
concordant = discordant = 0
for i in range(n):
    for j in range(i + 1, n):
        xi, yi = training_hours[i], val_accuracy[i]
        xj, yj = training_hours[j], val_accuracy[j]
        if (xi - xj) * (yi - yj) > 0:
            concordant += 1
        elif (xi - xj) * (yi - yj) < 0:
            discordant += 1
tau = (concordant - discordant) / (n * (n - 1) // 2)
print(f"Kendall tau = {tau:.4f}  (concordant={concordant}, discordant={discordant})")
text
Kendall tau = 1.0000  (concordant=21, discordant=0)

Point-biserial r handles the case where one variable is continuous and the other is binary (0/1). It is mathematically identical to Pearson r computed on a binary-coded variable — no special formula is needed. The ML use case is common: correlate a continuous feature (e.g. training_hours) with a binary label (e.g. did_converge ∈ {0, 1}). The result sits on the same −1 to +1 scale. Point-biserial r is also related to Cohen's d via r_pb = d / √(d² + (n₁ + n₂)² / (n₁n₂)), which means a significant point-biserial r implies the two groups differ in their means by a meaningful number of standard deviations.

Autocorrelation measures how a single time series correlates with a lagged copy of itself: r_k = Cov(Xₜ, Xₜ₋ₖ) / Var(X). In ML this matters everywhere — training loss curves, A/B test metrics measured daily, model monitoring dashboards are all time series. Autocorrelated observations violate the independence assumption baked into most hypothesis tests, inflating false-positive rates. For a full treatment, see the time series post in this series.

MeasureWhat It MeasuresScaleOutlier Resistant?Use When
CovarianceJoint variation in raw unitsUnboundedNoYou need to feed it into another formula (e.g. regression slope)
Pearson rLinear association, unit-free−1 to +1NoRelationship is approximately linear, no severe outliers
Spearman ρMonotone association via ranks−1 to +1YesOutliers present, or relationship is monotone but curved

Pearson r cannot detect non-linear relationships. A value near zero does not rule out a strong curved or step-function relationship — always plot the data before reporting a correlation. The quadratic example above (r = 0.000 for val_loss = reg_offset²) is not contrived — this pattern appears whenever you push a regularization coefficient too high or too low.

Correlation is not causation. r = 0.948 between training_hours and val_accuracy does not mean more training causes better accuracy. A confound could produce the same correlation — for example, if longer training runs were also given larger datasets. Always examine the data-generating process before drawing causal conclusions.

Test Your Understanding

Level 1 — Conceptual: If you multiply every training_hours value by 60 (converting hours to minutes), what happens to the sample covariance? What happens to Pearson r? Explain why.

Level 2 — Applied: The covariance between two variables is −0.45. Their standard deviations are σₓ = 1.5 and σᵧ = 0.4. Compute Pearson r by hand and describe the strength and direction of the relationship.

Level 3 — Reasoning: A researcher reports r = 0.02 between learning_rate and final validation loss for 200 training runs. A colleague says "there's no relationship." What should the researcher check before agreeing?

Level 4 — Design: You want to know whether GPU memory usage correlates with inference latency across 50 model deployments, but three deployments have GPU memory usage 10× higher than the rest due to a configuration bug. Should you use Pearson r or Spearman ρ? What would happen to Pearson r if you included those three deployments versus excluded them?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment