Back to blog
← View series: machine learning

Types of Machine Learning Equation of a Line, 3D Plane, and Hyperplane Distance of a Point from a Plane Instance-Based vs Model-Based Learning Simple Linear Regression Cost Function in Linear Regression Gradient Descent Multiple Linear Regression Performance Metrics for Regression Overfitting and Underfitting Linear Regression OLS: The Normal Equation Practicals: Simple and Multiple Linear Regression Polynomial Regression Ridge, Lasso, and ElasticNet Regression Cross-Validation End-to-End ML Project: Linear Regression

~/blog

Linear Regression OLS: The Normal Equation

Jun 25, 2026•5 min read•By Mohammed Vasim

Machine LearningAIData Science

Gradient descent finds the optimal weights iteratively. The normal equation finds them in one algebraic step. For small datasets, this is faster and exact — no learning rate to tune, no convergence to wait for. Understanding the derivation also reveals exactly when the normal equation fails and why Ridge regression fixes it.

Setting Up the Matrix System

We want to minimize:

$J (w) = ∥ y - X w ∥^{2} = (y - X w)^{⊤} (y - X w)$

Expand the product:

$J (w) = y^{⊤} y - 2 w^{⊤} X^{⊤} y + w^{⊤} X^{⊤} X w$

Take the gradient with respect to $w$ and set to zero:

$\frac{\partial J}{\partial w} = - 2 X^{⊤} y + 2 X^{⊤} X w = 0$

Rearranging gives the normal equations:

$X^{⊤} X w = X^{⊤} y$

If $X^{⊤} X$ is invertible, the unique solution is:

$w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$

The geometric interpretation: $w^{*}$ projects $y$ onto the column space of $X$ . The fitted values $X w^{*}$ are the closest point in the column space to $y$ .

Anchor dataset (2-feature, design matrix with intercept column):

python

import numpy as np

X_aug = np.array([
    [1, 650,  2],
    [1, 850,  2],
    [1, 1100, 3],
    [1, 1400, 3],
    [1, 1600, 4],
    [1, 1900, 4]
], dtype=float)

y = np.array([180, 220, 280, 340, 370, 430], dtype=float)

Computing $X^{⊤} X$

$X^{⊤}$ is $3 \times 6$ , $X$ is $6 \times 3$ , so $X^{⊤} X$ is $3 \times 3$ :

python

XtX = X_aug.T @ X_aug
print(XtX)

[[    6.    7500.   18.  ]
 [ 7500. 10985000. 23400.]
 [   18.   23400.   62.  ]]

Diagonal: $[n = 6, \sum x_{i 1}^{2} = 10985000, \sum x_{i 2}^{2} = 62]$

Off-diagonal: $[\sum x_{i 1} = 7500, \sum x_{i 2} = 18, \sum x_{i 1} x_{i 2} = 23400]$

Computing $X^{⊤} y$

$X^{⊤}$ is $3 \times 6$ , $y$ is $6 \times 1$ , so $X^{⊤} y$ is $3 \times 1$ :

python

Xty = X_aug.T @ y
print(Xty)

[ 1820. 2497000. 5510. ]

These are $[\sum y_{i} = 1820, \sum x_{i 1} y_{i} = 2497000, \sum x_{i 2} y_{i} = 5510]$ .

Solving the Normal Equations

Two approaches: direct inversion and LU decomposition. Always prefer the latter.

python

# Method 1: direct inversion (numerically unstable for ill-conditioned matrices)
w_inv = np.linalg.inv(XtX) @ Xty

# Method 2: np.linalg.solve (uses LU decomposition — preferred)
w_ols = np.linalg.solve(XtX, Xty)

print(f"w₀ = {w_ols[0]:.4f}")
print(f"w₁ = {w_ols[1]:.6f}")
print(f"w₂ = {w_ols[2]:.4f}")

w₀ = 58.3271
w₁ = 0.175000
w₂ = 9.6053

np.linalg.inv computes the full matrix inverse, then multiplies — two operations, twice the floating-point rounding error. np.linalg.solve uses LU decomposition to directly solve the linear system — more numerically stable, especially when $X^{⊤} X$ is nearly singular.

Verify: Prediction and Residuals

sq_ft	bedrooms	$\overset{y}{^}_{i}$	$y_{i}$	$ε_{i}$
650	2	58.33 + 0.175×650 + 9.61×2 = 191.1	180	−11.1
850	2	58.33 + 0.175×850 + 9.61×2 = 224.6	220	−4.6
1100	3	58.33 + 0.175×1100 + 9.61×3 = 280.4	280	−0.4
1400	3	58.33 + 0.175×1400 + 9.61×3 = 333.2	340	6.8
1600	4	58.33 + 0.175×1600 + 9.61×4 = 376.1	370	−6.1
1900	4	58.33 + 0.175×1900 + 9.61×4 = 408.5	430	21.5

python

y_pred = X_aug @ w_ols
sse = ((y - y_pred) ** 2).sum()
print(f"SSE (OLS): {sse:.4f}")

SSE (OLS): 743.2191

This is larger than simple regression's SSE of 133.3 — not because multiple regression is worse, but because the near-multicollinearity ( $r = 0.997$ between sqft and bedrooms) destabilizes the coefficients. The OLS solution is still mathematically optimal for these features, but the features themselves add little independent signal.

Why $(X^{⊤} X)^{- 1}$ Exists — and When It Doesn't

$(X^{⊤} X)^{- 1}$ exists if and only if $X$ has full column rank — no perfect linear dependence among the feature columns.

Three situations where $X^{⊤} X$ is singular:

More features than samples ( $p > n$ ): $X^{⊤} X$ is $p \times p$ but rank $\leq n < p$ — underdetermined system.
Perfect multicollinearity: feature A = $2 \times$ feature B → the two columns are linearly dependent → singular.
Zero-variance column: a feature with identical values for all samples has zero variance → column of constants (besides the bias) → dependent with the bias column.

Fix when singular: Ridge regression adds $λ I$ to $X^{⊤} X$ :

$w^{*} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y$

Adding $λ I$ to the diagonal guarantees positive-definiteness and invertibility for any $λ > 0$ , regardless of multicollinearity.

Computational Complexity

Operation	Complexity
Compute $X^{⊤} X$	$O (n p^{2})$
Invert $X^{⊤} X$	$O (p^{3})$
Total	$O (n p^{2} + p^{3})$

For $n = 1, 000, 000$ samples and $p = 1, 000$ features: $p^{3} = 1 0^{9}$ operations — hours of computation. Gradient descent costs $O (n p)$ per iteration — far better for large $p$ .

Normal Equation vs Gradient Descent

Aspect	Normal Equation	Gradient Descent
Learning rate $α$	Not needed	Must be tuned
Iterations	One-shot (exact)	Many iterations
Complexity	$O (n p^{2} + p^{3})$	$O (n p k)$ for $k$ iterations
Large $n$ (millions)	Slow ( $X^{⊤} X$ computation)	Fast (mini-batch)
Large $p$ (thousands)	Very slow ( $p^{3}$ inversion)	Manageable
Exact solution	Yes	Approximate (converges)
Feature scaling required	No	Yes (for speed)
Multicollinearity	Fails	Converges but unstable
Preferred when	$p < 1000$ , $n < 100 k$	$p$ large or $n$ large

OLS Formula Summary

Normal equations: $X^{⊤} X w = X^{⊤} y$
Closed form: $w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$
Numerically stable: use np.linalg.solve(XtX, Xty) over np.linalg.inv(XtX) @ Xty
Geometric interpretation: $w^{*}$ projects $y$ onto the column space of $X$

The normal equation is the analytical solution to the OLS problem — it's not an approximation. For pure linear regression on small datasets ( $n < 100 k$ , $p < 1000$ ), it's the right choice. Beyond those thresholds, the $p^{3}$ inversion becomes the bottleneck.

The deeper limitation: the normal equation tells you the optimal weights for the MSE objective, but it can't tell you if those weights are statistically reliable. When features are nearly collinear, the coefficients are mathematically correct but practically unstable — small data changes cause large coefficient changes. That's the problem Ridge regularization addresses, not by changing the optimal solution, but by changing the objective.

Test Your Understanding

Compute $X^{⊤} X$ by hand for the 1-feature anchor ( $n = 6$ , design matrix with intercept column). Verify the diagonal entries are $[n, \sum x_{i}^{2}]$ .
The np.linalg.solve function uses LU decomposition. Why does this avoid the numerical instability of inv()? What does "ill-conditioned" mean for $X^{⊤} X$ ?
You have $n = 500$ samples and $p = 600$ features. What does $X^{⊤} X$ look like, and why is $(X^{⊤} X)^{- 1}$ undefined without Ridge?
Adding $λ I$ to $X^{⊤} X$ makes it invertible. What happens to the OLS solution as $λ \to \infty$ ? As $λ \to 0$ ?
Two datasets: Dataset A has $n = 1000$ , $p = 5$ . Dataset B has $n = 100000$ , $p = 50$ . For which would you use the normal equation, and for which gradient descent? Compute the approximate operation count for each.

Linear Regression OLS: The Normal Equation

Setting Up the Matrix System

Computing $X^{⊤} X$

Computing $X^{⊤} y$

Solving the Normal Equations

Verify: Prediction and Residuals

Why $(X^{⊤} X)^{- 1}$ Exists — and When It Doesn't

Computational Complexity

Normal Equation vs Gradient Descent

OLS Formula Summary

Test Your Understanding

Comments (0)

Leave a comment

Linear Regression OLS: The Normal Equation

Setting Up the Matrix System

Computing X⊤X

Computing X⊤y

Solving the Normal Equations

Verify: Prediction and Residuals

Why (X⊤X)−1 Exists — and When It Doesn't

Computational Complexity

Normal Equation vs Gradient Descent

OLS Formula Summary

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Computing $X^{⊤} X$

Computing $X^{⊤} y$

Why $(X^{⊤} X)^{- 1}$ Exists — and When It Doesn't