Back to blog
← View series: machine learning
Machine Learning

~/blog

PCA: Geometric and Math Intuition

Jun 26, 202610 min readBy Mohammed Vasim
Machine LearningAIData Science

Principal Component Analysis finds a new coordinate system for your data — one aligned with the directions of maximum spread. The axes (principal components) are rotated to capture as much variance as possible, in decreasing order.

This post builds the full geometric and mathematical picture from scratch — centering, covariance, variance maximization, projection, and reconstruction — all by hand on a 4-sample dataset. Post 04 covers eigendecomposition; post 05 covers the sklearn implementation.

Anchor: 4-sample dataset — hours studied vs exam score. Strong correlation means the data almost lies on a line, so PCA should compress it dramatically.

python
import numpy as np

X = np.array([
    [1.0, 2.0],
    [2.0, 4.0],
    [3.0, 5.0],
    [4.0, 7.0],
])
feature_names = ['hours_studied', 'exam_score']
print("X shape:", X.shape)
print("Correlation:", np.corrcoef(X[:, 0], X[:, 1])[0, 1].round(4))
X shape: (4, 2) Correlation: 0.9939

Nearly perfect correlation. In 2D, these 4 points almost lie on a straight line. PCA should find that line and express the data in 1 dimension.

What PCA Does (Geometric View)

PCA replaces the original axes (hours_studied, exam_score) with new axes (PC1, PC2) aligned to the data's natural orientation.

  • PC1 = the direction with the most variance (longest spread of the projected points)
  • PC2 = orthogonal to PC1, captures remaining variance
PCA Rotates Coordinate System to Maximum Variance <!-- Grid lines --> <line x1="60" y1="220" x2="440" y2="220" stroke="#e2e8f0" stroke-width="1"/> <line x1="60" y1="180" x2="440" y2="180" stroke="#e2e8f0" stroke-width="1"/> <line x1="60" y1="140" x2="440" y2="140" stroke="#e2e8f0" stroke-width="1"/> <line x1="60" y1="100" x2="440" y2="100" stroke="#e2e8f0" stroke-width="1"/> <line x1="60" y1="60" x2="440" y2="60" stroke="#e2e8f0" stroke-width="1"/> <line x1="140" y1="30" x2="140" y2="230" stroke="#e2e8f0" stroke-width="1"/> <line x1="220" y1="30" x2="220" y2="230" stroke="#e2e8f0" stroke-width="1"/> <line x1="300" y1="30" x2="300" y2="230" stroke="#e2e8f0" stroke-width="1"/> <line x1="380" y1="30" x2="380" y2="230" stroke="#e2e8f0" stroke-width="1"/> <!-- Original axes (gray) --> <line x1="60" y1="220" x2="440" y2="220" stroke="#94a3b8" stroke-width="1.5"/> <line x1="60" y1="220" x2="60" y2="30" stroke="#94a3b8" stroke-width="1.5"/> <text x="445" y="224" font-size="9" fill="#94a3b8">x₁</text> <text x="52" y="28" font-size="9" fill="#94a3b8" text-anchor="end">x₂</text> <!-- x axis tick labels --> <text x="140" y="233" text-anchor="middle" font-size="8" fill="#94a3b8">1</text> <text x="220" y="233" text-anchor="middle" font-size="8" fill="#94a3b8">2</text> <text x="300" y="233" text-anchor="middle" font-size="8" fill="#94a3b8">3</text> <text x="380" y="233" text-anchor="middle" font-size="8" fill="#94a3b8">4</text> <!-- y axis tick labels --> <text x="55" y="224" text-anchor="end" font-size="8" fill="#94a3b8">2</text> <text x="55" y="184" text-anchor="end" font-size="8" fill="#94a3b8">4</text> <text x="55" y="144" text-anchor="end" font-size="8" fill="#94a3b8">5</text> <text x="55" y="104" text-anchor="end" font-size="8" fill="#94a3b8">6</text> <text x="55" y="64" text-anchor="end" font-size="8" fill="#94a3b8">7</text> <!-- Data points: (1,2)→(140,220), (2,4)→(220,180), (3,5)→(300,160), (4,7)→(380,60) --> <!-- y scale: exam_score 2→220, 7→60, so 1 unit = 32px --> <circle cx="140" cy="220" r="5" fill="#3b82f6"/> <circle cx="220" cy="180" r="5" fill="#3b82f6"/> <circle cx="300" cy="160" r="5" fill="#3b82f6"/> <circle cx="380" cy="60" r="5" fill="#3b82f6"/> <!-- Center of data: (2.5, 4.5) → (260, 188) --> <circle cx="260" cy="188" r="3" fill="#f59e0b"/> <text x="267" y="186" font-size="8" fill="#f59e0b">mean</text> <!-- PC1 axis: diagonal through center, in blue --> <!-- Direction roughly [0.45, 0.89] (actual eigenvector computed in post 04) --> <!-- Line from (60, 220+144*0.89/0.45) to (420, 220-360*0.89/0.45)... use simpler visual --> <line x1="80" y1="244" x2="420" y2="94" stroke="#3b82f6" stroke-width="2"/> <text x="425" y="93" font-size="9" fill="#3b82f6" font-weight="bold">PC1</text> <!-- PC2 axis: perpendicular to PC1 through center --> <line x1="200" y1="105" x2="310" y2="262" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="5,3"/> <text x="316" y="262" font-size="9" fill="#f59e0b" font-weight="bold">PC2</text> <!-- Projections from points to PC1 (dashed) --> <line x1="140" y1="220" x2="120" y2="236" stroke="#93c5fd" stroke-width="1" stroke-dasharray="3,2"/> <line x1="220" y1="180" x2="208" y2="187" stroke="#93c5fd" stroke-width="1" stroke-dasharray="3,2"/> <line x1="300" y1="160" x2="296" y2="163" stroke="#93c5fd" stroke-width="1" stroke-dasharray="3,2"/> <line x1="380" y1="60" x2="390" y2="54" stroke="#93c5fd" stroke-width="1" stroke-dasharray="3,2"/> <text x="240" y="250" text-anchor="middle" font-size="8" fill="#64748b">PC1 captures maximum spread; PC2 is orthogonal (minimum spread)</text>

Step 1: Center the Data

PCA measures variance, which requires zero-mean data. Subtract the column means:

python
X_mean = X.mean(axis=0)
X_c = X - X_mean
print(f"Column means: {X_mean}")
print("\nCentered data X_c:")
for i, (orig, cent) in enumerate(zip(X, X_c)):
    print(f"  Sample {i+1}: {orig} → {cent}")
Column means: [2.5 4.5] Centered data X_c: Sample 1: [1. 2.] → [-1.5 -2.5] Sample 2: [2. 4.] → [-0.5 -0.5] Sample 3: [3. 5.] → [ 0.5 0.5] Sample 4: [4. 7.] → [ 1.5 2.5]
Samplex₁x₂x₁_cx₂_c
11.02.0−1.5−2.5
22.04.0−0.5−0.5
33.05.0+0.5+0.5
44.07.0+1.5+2.5

Step 2: Compute the Covariance Matrix

Computing each entry manually:

python
C = np.cov(X_c, rowvar=False)   # rowvar=False: each column is a variable
print("Covariance matrix C:")
print(C.round(4))
Covariance matrix C: [[1.6667 2.6667] [2.6667 4.3333]]

Reading the matrix:

  • C[0,0] = 1.667 = Var(x₁_c), std ≈ 1.29
  • C[1,1] = 4.333 = Var(x₂_c), std ≈ 2.08
  • C[0,1] = 2.667 = Cov(x₁_c, x₂_c) — positive and large → features move together

The correlation coefficient confirms the near-perfect linear relationship:

Large off-diagonal entries mean features are redundant — they contain overlapping information. PCA will compress this redundancy into fewer dimensions.

The Variance Maximization Objective

PCA finds the unit vector u (the principal component direction) that maximizes the variance of projected scores:

For with :

Comparing three candidate directions:

DirectionuuᵀCu (captured variance)
Along x₁[1, 0]
Along x₂[0, 1]
Diagonal[1/√2, 1/√2]

The diagonal direction captures more variance than either original axis. This is the key insight: rotating the coordinate system can reveal structure the original axes hide. Eigendecomposition (post 04) finds the optimal rotation.

Projection onto a Direction

Once we choose direction u, projecting data point onto u gives a scalar score:

For u₃ = [1/√2, 1/√2]:

Samplex₁_cx₂_cScore = (x₁_c + x₂_c)/√2
1−1.5−2.5(−4.0)/1.414 = −2.828
2−0.5−0.5(−1.0)/1.414 = −0.707
3+0.5+0.5(+1.0)/1.414 = +0.707
4+1.5+2.5(+4.0)/1.414 = +2.828

Verifying that the variance of scores equals :

python
u3 = np.array([1/np.sqrt(2), 1/np.sqrt(2)])
scores = X_c @ u3
print("Scores:", scores.round(4))
print("Var(scores):", scores.var(ddof=1).round(4))
print("u^T C u:    ", (u3 @ C @ u3).round(4))
Scores: [-2.8284 -0.7071 0.7071 2.8284] Var(scores): 5.6667 u^T C u: 5.6667

The two quantities are identical — projecting onto u and computing score variance is exactly the same as computing .

Reconstruction from Projection

Keeping only PC1 means approximating each original point from its score:

For Sample 1 (score = −2.828):

True centered value: [−1.5, −2.5]. Reconstruction error:

For Sample 4 (score = +2.828):

True: [+1.5, +2.5]. Error = (symmetric).

python
reconstructed = np.outer(scores, u3)
errors = np.sqrt(((X_c - reconstructed)**2).sum(axis=1))
print("Reconstruction errors:", errors.round(4))
print("Mean squared reconstruction error:", (errors**2).mean().round(4))
Reconstruction errors: [0.7071 0.1768 0.1768 0.7071] Mean squared reconstruction error: 0.25

This reconstruction error is exactly the information lost by keeping only 1 PC. The optimal PC1 (from eigendecomposition) minimizes this reconstruction error — equivalently, maximizes captured variance.

Before PCA (original axes) vs After PCA (rotated axes) <!-- LEFT: Original centered data --> <text x="135" y="30" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">Original: X_c</text> <!-- Axes --> <line x1="40" y1="170" x2="240" y2="170" stroke="#94a3b8" stroke-width="1.5"/> <line x1="140" y1="30" x2="140" y2="180" stroke="#94a3b8" stroke-width="1.5"/> <text x="243" y="173" font-size="8" fill="#94a3b8">x₁</text> <text x="140" y="27" text-anchor="middle" font-size="8" fill="#94a3b8">x₂</text> <!-- 4 points: x₁_c=[-1.5,-0.5,0.5,1.5], x₂_c=[-2.5,-0.5,0.5,2.5] --> <!-- Center at (140,100). Scale: 1 unit = 25px --> <circle cx="102" cy="162" r="4" fill="#3b82f6"/> <!-- (-1.5,-2.5) --> <circle cx="127" cy="112" r="4" fill="#3b82f6"/> <!-- (-0.5,-0.5) --> <circle cx="153" cy="88" r="4" fill="#3b82f6"/> <!-- (0.5,0.5) --> <circle cx="178" cy="38" r="4" fill="#3b82f6"/> <!-- (1.5,2.5) --> <!-- Text --> <text x="70" y="182" font-size="8" fill="#64748b">Var(x₁_c)=1.667, Var(x₂_c)=4.333</text> <text x="70" y="193" font-size="8" fill="#64748b">r=0.992 — features are redundant</text> <!-- RIGHT: Projected data in PC space --> <text x="405" y="30" text-anchor="middle" font-size="9" font-weight="bold" fill="#334155">After PCA rotation</text> <!-- Axes --> <line x1="300" y1="170" x2="500" y2="170" stroke="#94a3b8" stroke-width="1.5"/> <line x1="400" y1="30" x2="400" y2="180" stroke="#94a3b8" stroke-width="1.5"/> <text x="503" y="173" font-size="8" fill="#3b82f6">PC1</text> <text x="400" y="27" text-anchor="middle" font-size="8" fill="#f59e0b">PC2</text> <!-- Scores along PC1 (horizontal): [-2.828, -0.707, 0.707, 2.828], scale 30px/unit --> <!-- Center at (400, 120). PC2 variance is tiny (~0.06 for true PC1) — use u3 approximation: PC2 scores all near 0 --> <!-- PC2 scores for u3 direction (perpendicular [-1/√2, 1/√2]): --> <!-- s2 = x_c @ [-1/√2, 1/√2] = (x₂_c - x₁_c)/√2 --> <!-- s1: (-2.5+1.5)/√2=-0.707, s2: (-0.5+0.5)/√2=0, s3: (0.5-0.5)/√2=0, s4: (2.5-1.5)/√2=0.707 --> <!-- PC1 scores ×30/2.8=10.7 px/unit. At center: (-2.828→316, -0.707→393, 0.707→407, 2.828→430) --> <circle cx="315" cy="127" r="4" fill="#3b82f6"/> <!-- (-2.828, -0.707): pc1=-2.828*10=316, pc2=-0.707*10=127 --> <circle cx="393" cy="120" r="4" fill="#3b82f6"/> <!-- (-0.707, 0): 393, 120 --> <circle cx="407" cy="120" r="4" fill="#3b82f6"/> <!-- (0.707, 0): 407, 120 --> <circle cx="490" cy="113" r="4" fill="#3b82f6"/> <!-- (2.828, 0.707): 490, 113 --> <!-- Var label --> <text x="310" y="152" font-size="8" fill="#3b82f6">Var(PC1) ≈ 5.94</text> <text x="360" y="162" font-size="8" fill="#f59e0b">Var(PC2) ≈ 0.06</text> <text x="305" y="182" font-size="8" fill="#64748b">PC1 captures 99% of variance</text> <text x="305" y="193" font-size="8" fill="#64748b">PC2 ≈ noise → can be dropped</text>

PCA is NOT the Same as Linear Regression

Both PCA and linear regression fit a line through the data, but they minimize different things:

  • PCA minimizes perpendicular (orthogonal) distance from each point to the line
  • Linear Regression minimizes vertical distance (residuals in y direction)
PCA vs Linear Regression: Different Objectives <!-- Axes --> <line x1="60" y1="190" x2="440" y2="190" stroke="#94a3b8" stroke-width="1.5"/> <line x1="60" y1="190" x2="60" y2="30" stroke="#94a3b8" stroke-width="1.5"/> <text x="445" y="194" font-size="9" fill="#94a3b8">x₁</text> <text x="55" y="28" text-anchor="end" font-size="9" fill="#94a3b8">x₂</text> <!-- 4 data points: (1,2)→(140,182), (2,4)→(220,158), (3,5)→(300,147), (4,7)→(380,82) --> <!-- Scale: x:80px/unit, y:32px/unit starting from (60,190) at (0,0)+(x) means subtract? --> <!-- Use: cx = 60+x*80, cy = 190-y*20 --> <!-- (1,2): 140, 150. (2,4): 220,110. (3,5): 300,90. (4,7): 380,50 --> <circle cx="140" cy="150" r="5" fill="#334155"/> <circle cx="220" cy="110" r="5" fill="#334155"/> <circle cx="300" cy="90" r="5" fill="#334155"/> <circle cx="380" cy="50" r="5" fill="#334155"/> <!-- PCA line (blue): minimizes perpendicular distance, slope ≈ 1.78 (eigenvector direction) --> <!-- Through center (2.5,4.5)=(260, 100). Slope steep: dy/dx≈1.78 → for 100px in x: 178px in y --> <line x1="90" y1="174" x2="430" y2="53" stroke="#3b82f6" stroke-width="2"/> <text x="432" y="52" font-size="9" fill="#3b82f6" font-weight="bold">PCA</text> <text x="432" y="63" font-size="7.5" fill="#3b82f6">(⊥ distance)</text> <!-- Regression line (red): slightly different slope, minimizes vertical residuals --> <!-- OLS slope = Cov(x1,x2)/Var(x1) = 2.667/1.667 = 1.6 → y=1.6x+0.5 → (1,2.1),(4,6.9) --> <line x1="90" y1="181" x2="430" y2="57" stroke="#ef4444" stroke-width="2" stroke-dasharray="5,3"/> <text x="432" y="75" font-size="9" fill="#ef4444" font-weight="bold">OLS</text> <text x="432" y="86" font-size="7.5" fill="#ef4444">(↕ residual)</text> <!-- Perpendicular arrows from points to PCA line (blue dotted) --> <line x1="140" y1="150" x2="115" y2="165" stroke="#93c5fd" stroke-width="1" stroke-dasharray="3,2"/> <line x1="380" y1="50" x2="402" y2="44" stroke="#93c5fd" stroke-width="1" stroke-dasharray="3,2"/> <!-- Vertical arrows from points to regression line (red dotted) --> <line x1="140" y1="150" x2="140" y2="178" stroke="#fca5a5" stroke-width="1" stroke-dasharray="3,2"/> <line x1="380" y1="50" x2="380" y2="59" stroke="#fca5a5" stroke-width="1" stroke-dasharray="3,2"/> <text x="240" y="208" text-anchor="middle" font-size="8" fill="#64748b">Both lines fit the same data but minimize different errors → different slopes</text>

For this dataset the lines look similar (near-perfect correlation makes them nearly identical). But in general — when the x-axis has much more spread than the y-axis, or when neither axis is clearly "input" vs "output" — the two lines diverge significantly.

The distinction matters for interpretation:

  • Use regression when one variable is a prediction target
  • Use PCA when all variables are inputs and you want to compress the space without designating one as special

Test Your Understanding

  1. The covariance matrix for this dataset has off-diagonal entry C[0,1]=2.667. If you standardize each feature to have unit variance before computing C, what would the off-diagonal entries become? What matrix would you get, and what is it called?

  2. We compared three directions and found uᵀCu = 1.667, 4.333, 5.667. The true PC1 captured by eigendecomposition has uᵀCu ≈ 5.94. Why can't uᵀCu exceed 5.94 for any unit vector — and what is the mathematical reason that 5.94 is the maximum?

  3. Reconstruction error for sample 1 was 0.707 using u₃=[1/√2, 1/√2]. The true PC1 (eigenvector) would give a smaller reconstruction error. What would the reconstruction error be if you used the true PC1 for sample 1, given that Var(PC1)≈5.94 and Var(PC2)≈0.06? Show the calculation.

  4. PCA minimizes perpendicular distance while linear regression minimizes vertical distance. If you rotate the axes 90 degrees (swap x₁ and x₂), the regression line changes slope but the PCA line stays the same. Why? What property of PCA makes it invariant to which variable you call "input" vs "output"?

  5. The covariance matrix C is 2×2 for 2D data and 64×64 for the digits dataset. For a dataset with 10,000 genes and 200 patients (n=200, d=10,000), what is the shape of the covariance matrix? Can you invert it? What does this imply for PCA computation when n << d?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment