Back to blog
← View series: machine learning

~/blog

Linear Regression OLS: The Normal Equation

Jun 25, 20265 min readBy Mohammed Vasim
Machine LearningAIData Science

Gradient descent finds the optimal weights iteratively. The normal equation finds them in one algebraic step. For small datasets, this is faster and exact — no learning rate to tune, no convergence to wait for. Understanding the derivation also reveals exactly when the normal equation fails and why Ridge regression fixes it.

Setting Up the Matrix System

We want to minimize:

Expand the product:

Take the gradient with respect to and set to zero:

Rearranging gives the normal equations:

If is invertible, the unique solution is:

The geometric interpretation: projects onto the column space of . The fitted values are the closest point in the column space to .

Anchor dataset (2-feature, design matrix with intercept column):

python
import numpy as np

X_aug = np.array([
    [1, 650,  2],
    [1, 850,  2],
    [1, 1100, 3],
    [1, 1400, 3],
    [1, 1600, 4],
    [1, 1900, 4]
], dtype=float)

y = np.array([180, 220, 280, 340, 370, 430], dtype=float)

Computing

is , is , so is :

python
XtX = X_aug.T @ X_aug
print(XtX)
[[ 6. 7500. 18. ] [ 7500. 10985000. 23400.] [ 18. 23400. 62. ]]

Diagonal:

Off-diagonal:

Computing

is , is , so is :

python
Xty = X_aug.T @ y
print(Xty)
[ 1820. 2497000. 5510. ]

These are .

Solving the Normal Equations

Two approaches: direct inversion and LU decomposition. Always prefer the latter.

python
# Method 1: direct inversion (numerically unstable for ill-conditioned matrices)
w_inv = np.linalg.inv(XtX) @ Xty

# Method 2: np.linalg.solve (uses LU decomposition — preferred)
w_ols = np.linalg.solve(XtX, Xty)

print(f"w₀ = {w_ols[0]:.4f}")
print(f"w₁ = {w_ols[1]:.6f}")
print(f"w₂ = {w_ols[2]:.4f}")
w₀ = 58.3271 w₁ = 0.175000 w₂ = 9.6053

np.linalg.inv computes the full matrix inverse, then multiplies — two operations, twice the floating-point rounding error. np.linalg.solve uses LU decomposition to directly solve the linear system — more numerically stable, especially when is nearly singular.

Verify: Prediction and Residuals

sq_ftbedrooms
650258.33 + 0.175×650 + 9.61×2 = 191.1180−11.1
850258.33 + 0.175×850 + 9.61×2 = 224.6220−4.6
1100358.33 + 0.175×1100 + 9.61×3 = 280.4280−0.4
1400358.33 + 0.175×1400 + 9.61×3 = 333.23406.8
1600458.33 + 0.175×1600 + 9.61×4 = 376.1370−6.1
1900458.33 + 0.175×1900 + 9.61×4 = 408.543021.5
python
y_pred = X_aug @ w_ols
sse = ((y - y_pred) ** 2).sum()
print(f"SSE (OLS): {sse:.4f}")
SSE (OLS): 743.2191

This is larger than simple regression's SSE of 133.3 — not because multiple regression is worse, but because the near-multicollinearity ( between sqft and bedrooms) destabilizes the coefficients. The OLS solution is still mathematically optimal for these features, but the features themselves add little independent signal.

Why Exists — and When It Doesn't

exists if and only if has full column rank — no perfect linear dependence among the feature columns.

Three situations where is singular:

  1. More features than samples (): is but rank — underdetermined system.
  2. Perfect multicollinearity: feature A = feature B → the two columns are linearly dependent → singular.
  3. Zero-variance column: a feature with identical values for all samples has zero variance → column of constants (besides the bias) → dependent with the bias column.

Fix when singular: Ridge regression adds to :

Adding to the diagonal guarantees positive-definiteness and invertibility for any , regardless of multicollinearity.

Computational Complexity

OperationComplexity
Compute
Invert
Total

For samples and features: operations — hours of computation. Gradient descent costs per iteration — far better for large .

Normal Equation vs Gradient Descent

AspectNormal EquationGradient Descent
Learning rate Not neededMust be tuned
IterationsOne-shot (exact)Many iterations
Complexity for iterations
Large (millions)Slow ( computation)Fast (mini-batch)
Large (thousands)Very slow ( inversion)Manageable
Exact solutionYesApproximate (converges)
Feature scaling requiredNoYes (for speed)
MulticollinearityFailsConverges but unstable
Preferred when, large or large

OLS Formula Summary

  • Normal equations:
  • Closed form:
  • Numerically stable: use np.linalg.solve(XtX, Xty) over np.linalg.inv(XtX) @ Xty
  • Geometric interpretation: projects onto the column space of

The normal equation is the analytical solution to the OLS problem — it's not an approximation. For pure linear regression on small datasets (, ), it's the right choice. Beyond those thresholds, the inversion becomes the bottleneck.

The deeper limitation: the normal equation tells you the optimal weights for the MSE objective, but it can't tell you if those weights are statistically reliable. When features are nearly collinear, the coefficients are mathematically correct but practically unstable — small data changes cause large coefficient changes. That's the problem Ridge regularization addresses, not by changing the optimal solution, but by changing the objective.

Test Your Understanding

  1. Compute by hand for the 1-feature anchor (, design matrix with intercept column). Verify the diagonal entries are .

  2. The np.linalg.solve function uses LU decomposition. Why does this avoid the numerical instability of inv()? What does "ill-conditioned" mean for ?

  3. You have samples and features. What does look like, and why is undefined without Ridge?

  4. Adding to makes it invertible. What happens to the OLS solution as ? As ?

  5. Two datasets: Dataset A has , . Dataset B has , . For which would you use the normal equation, and for which gradient descent? Compute the approximate operation count for each.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment