Back to blog
← View series: machine learning

Types of Machine Learning Equation of a Line, 3D Plane, and Hyperplane Distance of a Point from a Plane Instance-Based vs Model-Based Learning Simple Linear Regression Cost Function in Linear Regression Gradient Descent Multiple Linear Regression Performance Metrics for Regression Overfitting and Underfitting Linear Regression OLS: The Normal Equation Practicals: Simple and Multiple Linear Regression Polynomial Regression Ridge, Lasso, and ElasticNet Regression Cross-Validation End-to-End ML Project: Linear Regression

~/blog

Equation of a Line, 3D Plane, and Hyperplane

Jun 25, 2026•7 min read•By Mohammed Vasim

Machine LearningAIData Science

Linear regression is a machine that draws a flat surface through data. Before training weights, you need a geometric grip on what "flat surface" means across 1, 2, and arbitrarily many features — because every linear model, from logistic regression to the linear layer in a transformer, is doing the same thing in higher-dimensional space.

The Equation of a Line (2D)

With one feature — say, square footage — the prediction is a line:

$\overset{y}{^} = w_{0} + w_{1} x$

$w_{0}$ is the intercept: the value of $\overset{y}{^}$ when $x = 0$ . $w_{1}$ is the slope: how much $\overset{y}{^}$ changes for every one-unit increase in $x$ .

For predicting house price from square footage, assume $w_{0} = 50$ and $w_{1} = 0.20$ . Every extra square foot adds $0.20 k (i . e .,$ 200) to the predicted price.

$x_{i}$ (sq_ft)	$w_{0} + w_{1} x_{i}$	$\overset{y}{^}_{i}$	$y_{i}$	residual $ε_{i}$
650	50 + 0.20×650	180.0	180	0.0
850	50 + 0.20×850	220.0	220	0.0
1100	50 + 0.20×1100	270.0	280	10.0
1400	50 + 0.20×1400	330.0	340	10.0
1600	50 + 0.20×1600	370.0	370	0.0
1900	50 + 0.20×1900	430.0	430	0.0

The two non-zero residuals (at 1100 and 1400 sq ft) tell us the weights aren't quite optimal — but they're close. The goal of training is to find $w_{0}$ and $w_{1}$ that minimize the total squared residual.

<line x1="60" y1="270" x2="520" y2="270" stroke="#334155" stroke-width="1.5"/>
<line x1="60" y1="20" x2="60" y2="270" stroke="#334155" stroke-width="1.5"/>

<text x="290" y="305" text-anchor="middle" font-size="12" fill="#334155">sq_ft</text>
<text x="22" y="145" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,22,145)">price ($k)</text>

<text x="65" y="283" text-anchor="middle" font-size="10" fill="#64748b">600</text>
<text x="165" y="283" text-anchor="middle" font-size="10" fill="#64748b">900</text>
<text x="265" y="283" text-anchor="middle" font-size="10" fill="#64748b">1200</text>
<text x="365" y="283" text-anchor="middle" font-size="10" fill="#64748b">1500</text>
<text x="465" y="283" text-anchor="middle" font-size="10" fill="#64748b">1800</text>

<text x="55" y="274" text-anchor="end" font-size="10" fill="#64748b">150</text>
<text x="55" y="224" text-anchor="end" font-size="10" fill="#64748b">200</text>
<text x="55" y="174" text-anchor="end" font-size="10" fill="#64748b">250</text>
<text x="55" y="124" text-anchor="end" font-size="10" fill="#64748b">300</text>
<text x="55" y="74" text-anchor="end" font-size="10" fill="#64748b">350</text>
<text x="55" y="32" text-anchor="end" font-size="10" fill="#64748b">430</text>

<line x1="60" y1="270" x2="520" y2="34" stroke="#3b82f6" stroke-width="1.8"/>

<line x1="178" y1="100" x2="278" y2="80" stroke="#94a3b8" stroke-width="1" stroke-dasharray="4,3"/>
<line x1="278" y1="100" x2="278" y2="80" stroke="#f59e0b" stroke-width="1.5"/>
<text x="290" y="93" font-size="10" fill="#f59e0b">rise=20</text>
<text x="165" y="115" font-size="10" fill="#64748b">run=100</text>
<text x="305" y="76" font-size="10" fill="#3b82f6">slope=0.20</text>

<text x="80" y="262" font-size="10" fill="#3b82f6">w₀=50 (x=0)</text>

<circle cx="113" cy="90" r="5" fill="#dc2626"/>
<circle cx="163" cy="70" r="5" fill="#dc2626"/>
<circle cx="230" cy="30" r="5" fill="#dc2626"/>
<circle cx="313" cy="270" r="0" fill="none"/>

<circle cx="113" cy="90" r="5" fill="#1d4ed8"/>
<circle cx="163" cy="70" r="5" fill="#1d4ed8"/>
<circle cx="238" cy="40" r="5" fill="#1d4ed8"/>
<circle cx="313" cy="240" r="5" fill="#1d4ed8"/>
<circle cx="363" cy="220" r="5" fill="#1d4ed8"/>
<circle cx="438" cy="180" r="5" fill="#1d4ed8"/>

<line x1="238" y1="47" x2="238" y2="34" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="244" y="42" font-size="9" fill="#f59e0b">ε=10</text>

<line x1="313" y1="240" x2="313" y2="227" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="319" y="235" font-size="9" fill="#f59e0b">ε=10</text>

The slope's sign tells you the direction: $w_{1} > 0$ means larger houses cost more. $w_{1} < 0$ would mean the opposite. $w_{1} = 0$ means the line is horizontal — a feature with no predictive power.

What Changes at 3D: The Equation of a Plane

Add a second feature — number of bedrooms — and the model becomes:

$\overset{y}{^} = w_{0} + w_{1} \times sqft + w_{2} \times bedrooms$

With two features, a single prediction now requires values on two axes, and the model surface is a plane floating in 3D. Assume $w_{0} = 30$ , $w_{1} = 0.17$ , $w_{2} = 15$ :

sq_ft	bedrooms	$w_{0} + w_{1} \cdot sqft + w_{2} \cdot beds$	$\overset{y}{^}$	$y$	$ε$
650	2	30 + 110.5 + 30	170.5	180	9.5
850	2	30 + 144.5 + 30	204.5	220	15.5
1100	3	30 + 187.0 + 45	262.0	280	18.0
1400	3	30 + 238.0 + 45	313.0	340	27.0
1600	4	30 + 272.0 + 60	362.0	370	8.0
1900	4	30 + 323.0 + 60	413.0	430	17.0

The residuals are larger than the single-feature case — these particular weights ( $w_{0} = 30, w_{1} = 0.17, w_{2} = 15$ ) are illustrative, not optimal. Training will find better values.

<text x="510" y="295" font-size="11" fill="#334155">sq_ft</text>
<text x="75" y="30" text-anchor="end" font-size="11" fill="#334155">price ($k)</text>
<text x="8" y="338" font-size="11" fill="#334155">beds</text>

<polygon points="100,250 200,210 400,170 480,150 380,190 180,230" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.2" opacity="0.7"/>

<circle cx="110" cy="240" r="5" fill="#1d4ed8"/>
<line x1="110" y1="240" x2="110" y2="222" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="115" y="235" font-size="9" fill="#f59e0b">9.5</text>

<circle cx="170" cy="218" r="5" fill="#1d4ed8"/>
<line x1="170" y1="218" x2="170" y2="200" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>

<circle cx="250" cy="185" r="5" fill="#1d4ed8"/>
<line x1="250" y1="185" x2="250" y2="165" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>

<circle cx="330" cy="150" r="5" fill="#1d4ed8"/>
<line x1="330" y1="150" x2="330" y2="133" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>

<circle cx="390" cy="130" r="5" fill="#1d4ed8"/>
<line x1="390" y1="130" x2="390" y2="122" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>

<circle cx="460" cy="100" r="5" fill="#1d4ed8"/>
<line x1="460" y1="100" x2="460" y2="88" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>

<text x="200" y="200" font-size="11" fill="#3b82f6">fitted plane</text>
<text x="200" y="212" font-size="10" fill="#64748b">ŷ = 30 + 0.17·sqft + 15·beds</text>
<text x="150" y="320" font-size="10" fill="#f59e0b">— residual sticks (point → plane)</text>

Generalizing to p Features: The Hyperplane

With $p$ features the model is:

$\overset{y}{^} = w_{0} + w_{1} x_{1} + w_{2} x_{2} + \dots + w_{p} x_{p}$

In compact dot-product form, prepend a 1 to each input vector and absorb the intercept into the weight vector:

$\overset{y}{^} = w \cdot x where w = [w_{0}, w_{1}, \dots, w_{p}], x = [1, x_{1}, \dots, x_{p}]$

A hyperplane in $p + 1$ dimensions is still a flat surface — it just can't be visualized beyond 3D. The word "hyper" means dimension, not complexity. The relationship is still linear in the parameters.

The Intercept Trick

Without $w_{0}$ , the hyperplane is forced to pass through the origin. Most real data doesn't pass through the origin — a house with zero square footage doesn't have zero price in the model's internal representation. The standard fix: append a column of ones to the feature matrix.

For the 1-feature anchor, the design matrix $X$ with an intercept column is:

$X = 1111116508501100140016001900, w = [50 0.20]$

The matrix product $X w$ gives predictions for all six samples at once:

$X w = 1 \times 50 + 650 \times 0.20 1 \times 50 + 850 \times 0.20 1 \times 50 + 1100 \times 0.20 1 \times 50 + 1400 \times 0.20 1 \times 50 + 1600 \times 0.20 1 \times 50 + 1900 \times 0.20 = 180220270330370430$

For the 2-feature anchor, $X$ expands to 6×3:

$X = 1111116508501100140016001900223344$

The model $\hat{y} = X w$ now holds for all samples simultaneously. This matrix form is how every linear model is implemented at scale — no loops over samples.

Why This Matters for ML

Every linear model is a hyperplane. Logistic regression uses a hyperplane as a decision boundary — points on one side are class 1, the other class 0. SVMs find the hyperplane with maximum margin. The linear layer in a neural network applies this multiplication at each layer. Understanding the geometry now means every subsequent algorithm is just a variation on how the weights $w$ are found.

The next question is: which $w$ is best? That requires a loss function.

Geometry Summary

Dimensions	Equation	Geometric Object	Visualizable?
1 feature	$\overset{y}{^} = w_{0} + w_{1} x$	Line (2D)	Yes
2 features	$\overset{y}{^} = w_{0} + w_{1} x_{1} + w_{2} x_{2}$	Plane (3D)	Yes
3 features	$\overset{y}{^} = w_{0} + w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3}$	Hyperplane (4D)	No
$p$ features	$\overset{y}{^} = w \cdot x$	Hyperplane ( $p + 1$ D)	No

The design matrix $X$ with a leading column of ones is the same representation used to derive the OLS closed-form solution $w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$ . Understanding why $X^{⊤} X$ appears there requires exactly the matrix form developed here.

The limitation is linearity itself. If the true relationship between square footage and price curves — prices rise steeply at first, then plateau — a hyperplane can only approximate it. Polynomial regression adds $x^{2}, x^{3}, \dots$ columns to $X$ to handle this, but the model remains linear in the parameters. Genuinely non-linear relationships (e.g., exponential growth, tree-structured decision rules) require a different model class.

Test Your Understanding

With $w_{0} = 50$ and $w_{1} = 0.20$ , what is $\overset{y}{^}$ for a house of 1250 sq ft? What is the residual if the true price is $290k?
Why does appending a column of ones to $X$ allow the model to learn a non-zero intercept? What would happen geometrically if you left it out and the true intercept was $50k?
A colleague proposes fitting two separate lines — one for small houses and one for large houses — instead of a single hyperplane. When would this be better, and what model class formalizes that idea?
For the 2-feature case, the coefficient $w_{2} = 15$ means each bedroom adds $15k to price holding sq_ft fixed. How would you confirm this interpretation from the trace table?
If you have $p = 1000$ features and $n = 500$ samples, what does the design matrix $X$ look like, and why does this cause problems for the OLS formula $w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$ ?

Equation of a Line, 3D Plane, and Hyperplane

The Equation of a Line (2D)

What Changes at 3D: The Equation of a Plane

Generalizing to p Features: The Hyperplane

The Intercept Trick

Why This Matters for ML

Geometry Summary

Test Your Understanding

Comments (0)

Leave a comment

Equation of a Line, 3D Plane, and Hyperplane

The Equation of a Line (2D)

What Changes at 3D: The Equation of a Plane

Generalizing to p Features: The Hyperplane

The Intercept Trick

Why This Matters for ML

Geometry Summary

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment