← View series: machine learning
~/blog
Linear Regression OLS: The Normal Equation
Gradient descent finds the optimal weights iteratively. The normal equation finds them in one algebraic step. For small datasets, this is faster and exact — no learning rate to tune, no convergence to wait for. Understanding the derivation also reveals exactly when the normal equation fails and why Ridge regression fixes it.
Setting Up the Matrix System
We want to minimize:
Expand the product:
Take the gradient with respect to and set to zero:
Rearranging gives the normal equations:
If is invertible, the unique solution is:
The geometric interpretation: projects onto the column space of . The fitted values are the closest point in the column space to .
Anchor dataset (2-feature, design matrix with intercept column):
import numpy as np
X_aug = np.array([
[1, 650, 2],
[1, 850, 2],
[1, 1100, 3],
[1, 1400, 3],
[1, 1600, 4],
[1, 1900, 4]
], dtype=float)
y = np.array([180, 220, 280, 340, 370, 430], dtype=float)Computing
is , is , so is :
XtX = X_aug.T @ X_aug
print(XtX)[[ 6. 7500. 18. ]
[ 7500. 10985000. 23400.]
[ 18. 23400. 62. ]]
Diagonal:
Off-diagonal:
Computing
is , is , so is :
Xty = X_aug.T @ y
print(Xty)[ 1820. 2497000. 5510. ]
These are .
Solving the Normal Equations
Two approaches: direct inversion and LU decomposition. Always prefer the latter.
# Method 1: direct inversion (numerically unstable for ill-conditioned matrices)
w_inv = np.linalg.inv(XtX) @ Xty
# Method 2: np.linalg.solve (uses LU decomposition — preferred)
w_ols = np.linalg.solve(XtX, Xty)
print(f"w₀ = {w_ols[0]:.4f}")
print(f"w₁ = {w_ols[1]:.6f}")
print(f"w₂ = {w_ols[2]:.4f}")w₀ = 58.3271
w₁ = 0.175000
w₂ = 9.6053
np.linalg.inv computes the full matrix inverse, then multiplies — two operations, twice the floating-point rounding error. np.linalg.solve uses LU decomposition to directly solve the linear system — more numerically stable, especially when is nearly singular.
Verify: Prediction and Residuals
| sq_ft | bedrooms | |||
|---|---|---|---|---|
| 650 | 2 | 58.33 + 0.175×650 + 9.61×2 = 191.1 | 180 | −11.1 |
| 850 | 2 | 58.33 + 0.175×850 + 9.61×2 = 224.6 | 220 | −4.6 |
| 1100 | 3 | 58.33 + 0.175×1100 + 9.61×3 = 280.4 | 280 | −0.4 |
| 1400 | 3 | 58.33 + 0.175×1400 + 9.61×3 = 333.2 | 340 | 6.8 |
| 1600 | 4 | 58.33 + 0.175×1600 + 9.61×4 = 376.1 | 370 | −6.1 |
| 1900 | 4 | 58.33 + 0.175×1900 + 9.61×4 = 408.5 | 430 | 21.5 |
y_pred = X_aug @ w_ols
sse = ((y - y_pred) ** 2).sum()
print(f"SSE (OLS): {sse:.4f}")SSE (OLS): 743.2191
This is larger than simple regression's SSE of 133.3 — not because multiple regression is worse, but because the near-multicollinearity ( between sqft and bedrooms) destabilizes the coefficients. The OLS solution is still mathematically optimal for these features, but the features themselves add little independent signal.
Why Exists — and When It Doesn't
exists if and only if has full column rank — no perfect linear dependence among the feature columns.
Three situations where is singular:
- More features than samples (): is but rank — underdetermined system.
- Perfect multicollinearity: feature A = feature B → the two columns are linearly dependent → singular.
- Zero-variance column: a feature with identical values for all samples has zero variance → column of constants (besides the bias) → dependent with the bias column.
Fix when singular: Ridge regression adds to :
Adding to the diagonal guarantees positive-definiteness and invertibility for any , regardless of multicollinearity.
Computational Complexity
| Operation | Complexity |
|---|---|
| Compute | |
| Invert | |
| Total |
For samples and features: operations — hours of computation. Gradient descent costs per iteration — far better for large .
Normal Equation vs Gradient Descent
| Aspect | Normal Equation | Gradient Descent |
|---|---|---|
| Learning rate | Not needed | Must be tuned |
| Iterations | One-shot (exact) | Many iterations |
| Complexity | for iterations | |
| Large (millions) | Slow ( computation) | Fast (mini-batch) |
| Large (thousands) | Very slow ( inversion) | Manageable |
| Exact solution | Yes | Approximate (converges) |
| Feature scaling required | No | Yes (for speed) |
| Multicollinearity | Fails | Converges but unstable |
| Preferred when | , | large or large |
OLS Formula Summary
- Normal equations:
- Closed form:
- Numerically stable: use
np.linalg.solve(XtX, Xty)overnp.linalg.inv(XtX) @ Xty - Geometric interpretation: projects onto the column space of
Related Concepts and Honest Limitations
The normal equation is the analytical solution to the OLS problem — it's not an approximation. For pure linear regression on small datasets (, ), it's the right choice. Beyond those thresholds, the inversion becomes the bottleneck.
The deeper limitation: the normal equation tells you the optimal weights for the MSE objective, but it can't tell you if those weights are statistically reliable. When features are nearly collinear, the coefficients are mathematically correct but practically unstable — small data changes cause large coefficient changes. That's the problem Ridge regularization addresses, not by changing the optimal solution, but by changing the objective.
Test Your Understanding
-
Compute by hand for the 1-feature anchor (, design matrix with intercept column). Verify the diagonal entries are .
-
The
np.linalg.solvefunction uses LU decomposition. Why does this avoid the numerical instability ofinv()? What does "ill-conditioned" mean for ? -
You have samples and features. What does look like, and why is undefined without Ridge?
-
Adding to makes it invertible. What happens to the OLS solution as ? As ?
-
Two datasets: Dataset A has , . Dataset B has , . For which would you use the normal equation, and for which gradient descent? Compute the approximate operation count for each.