← View series: machine learning
~/blog
Equation of a Line, 3D Plane, and Hyperplane
Linear regression is a machine that draws a flat surface through data. Before training weights, you need a geometric grip on what "flat surface" means across 1, 2, and arbitrarily many features — because every linear model, from logistic regression to the linear layer in a transformer, is doing the same thing in higher-dimensional space.
The Equation of a Line (2D)
With one feature — say, square footage — the prediction is a line:
is the intercept: the value of when . is the slope: how much changes for every one-unit increase in .
For predicting house price from square footage, assume and . Every extra square foot adds 200) to the predicted price.
| (sq_ft) | residual | |||
|---|---|---|---|---|
| 650 | 50 + 0.20×650 | 180.0 | 180 | 0.0 |
| 850 | 50 + 0.20×850 | 220.0 | 220 | 0.0 |
| 1100 | 50 + 0.20×1100 | 270.0 | 280 | 10.0 |
| 1400 | 50 + 0.20×1400 | 330.0 | 340 | 10.0 |
| 1600 | 50 + 0.20×1600 | 370.0 | 370 | 0.0 |
| 1900 | 50 + 0.20×1900 | 430.0 | 430 | 0.0 |
The two non-zero residuals (at 1100 and 1400 sq ft) tell us the weights aren't quite optimal — but they're close. The goal of training is to find and that minimize the total squared residual.
<line x1="60" y1="270" x2="520" y2="270" stroke="#334155" stroke-width="1.5"/>
<line x1="60" y1="20" x2="60" y2="270" stroke="#334155" stroke-width="1.5"/>
<text x="290" y="305" text-anchor="middle" font-size="12" fill="#334155">sq_ft</text>
<text x="22" y="145" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,22,145)">price ($k)</text>
<text x="65" y="283" text-anchor="middle" font-size="10" fill="#64748b">600</text>
<text x="165" y="283" text-anchor="middle" font-size="10" fill="#64748b">900</text>
<text x="265" y="283" text-anchor="middle" font-size="10" fill="#64748b">1200</text>
<text x="365" y="283" text-anchor="middle" font-size="10" fill="#64748b">1500</text>
<text x="465" y="283" text-anchor="middle" font-size="10" fill="#64748b">1800</text>
<text x="55" y="274" text-anchor="end" font-size="10" fill="#64748b">150</text>
<text x="55" y="224" text-anchor="end" font-size="10" fill="#64748b">200</text>
<text x="55" y="174" text-anchor="end" font-size="10" fill="#64748b">250</text>
<text x="55" y="124" text-anchor="end" font-size="10" fill="#64748b">300</text>
<text x="55" y="74" text-anchor="end" font-size="10" fill="#64748b">350</text>
<text x="55" y="32" text-anchor="end" font-size="10" fill="#64748b">430</text>
<line x1="60" y1="270" x2="520" y2="34" stroke="#3b82f6" stroke-width="1.8"/>
<line x1="178" y1="100" x2="278" y2="80" stroke="#94a3b8" stroke-width="1" stroke-dasharray="4,3"/>
<line x1="278" y1="100" x2="278" y2="80" stroke="#f59e0b" stroke-width="1.5"/>
<text x="290" y="93" font-size="10" fill="#f59e0b">rise=20</text>
<text x="165" y="115" font-size="10" fill="#64748b">run=100</text>
<text x="305" y="76" font-size="10" fill="#3b82f6">slope=0.20</text>
<text x="80" y="262" font-size="10" fill="#3b82f6">w₀=50 (x=0)</text>
<circle cx="113" cy="90" r="5" fill="#dc2626"/>
<circle cx="163" cy="70" r="5" fill="#dc2626"/>
<circle cx="230" cy="30" r="5" fill="#dc2626"/>
<circle cx="313" cy="270" r="0" fill="none"/>
<circle cx="113" cy="90" r="5" fill="#1d4ed8"/>
<circle cx="163" cy="70" r="5" fill="#1d4ed8"/>
<circle cx="238" cy="40" r="5" fill="#1d4ed8"/>
<circle cx="313" cy="240" r="5" fill="#1d4ed8"/>
<circle cx="363" cy="220" r="5" fill="#1d4ed8"/>
<circle cx="438" cy="180" r="5" fill="#1d4ed8"/>
<line x1="238" y1="47" x2="238" y2="34" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="244" y="42" font-size="9" fill="#f59e0b">ε=10</text>
<line x1="313" y1="240" x2="313" y2="227" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="319" y="235" font-size="9" fill="#f59e0b">ε=10</text>
The slope's sign tells you the direction: means larger houses cost more. would mean the opposite. means the line is horizontal — a feature with no predictive power.
What Changes at 3D: The Equation of a Plane
Add a second feature — number of bedrooms — and the model becomes:
With two features, a single prediction now requires values on two axes, and the model surface is a plane floating in 3D. Assume , , :
| sq_ft | bedrooms | ||||
|---|---|---|---|---|---|
| 650 | 2 | 30 + 110.5 + 30 | 170.5 | 180 | 9.5 |
| 850 | 2 | 30 + 144.5 + 30 | 204.5 | 220 | 15.5 |
| 1100 | 3 | 30 + 187.0 + 45 | 262.0 | 280 | 18.0 |
| 1400 | 3 | 30 + 238.0 + 45 | 313.0 | 340 | 27.0 |
| 1600 | 4 | 30 + 272.0 + 60 | 362.0 | 370 | 8.0 |
| 1900 | 4 | 30 + 323.0 + 60 | 413.0 | 430 | 17.0 |
The residuals are larger than the single-feature case — these particular weights () are illustrative, not optimal. Training will find better values.
<text x="510" y="295" font-size="11" fill="#334155">sq_ft</text>
<text x="75" y="30" text-anchor="end" font-size="11" fill="#334155">price ($k)</text>
<text x="8" y="338" font-size="11" fill="#334155">beds</text>
<polygon points="100,250 200,210 400,170 480,150 380,190 180,230" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.2" opacity="0.7"/>
<circle cx="110" cy="240" r="5" fill="#1d4ed8"/>
<line x1="110" y1="240" x2="110" y2="222" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="115" y="235" font-size="9" fill="#f59e0b">9.5</text>
<circle cx="170" cy="218" r="5" fill="#1d4ed8"/>
<line x1="170" y1="218" x2="170" y2="200" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<circle cx="250" cy="185" r="5" fill="#1d4ed8"/>
<line x1="250" y1="185" x2="250" y2="165" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<circle cx="330" cy="150" r="5" fill="#1d4ed8"/>
<line x1="330" y1="150" x2="330" y2="133" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<circle cx="390" cy="130" r="5" fill="#1d4ed8"/>
<line x1="390" y1="130" x2="390" y2="122" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<circle cx="460" cy="100" r="5" fill="#1d4ed8"/>
<line x1="460" y1="100" x2="460" y2="88" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="200" y="200" font-size="11" fill="#3b82f6">fitted plane</text>
<text x="200" y="212" font-size="10" fill="#64748b">ŷ = 30 + 0.17·sqft + 15·beds</text>
<text x="150" y="320" font-size="10" fill="#f59e0b">— residual sticks (point → plane)</text>
Generalizing to p Features: The Hyperplane
With features the model is:
In compact dot-product form, prepend a 1 to each input vector and absorb the intercept into the weight vector:
A hyperplane in dimensions is still a flat surface — it just can't be visualized beyond 3D. The word "hyper" means dimension, not complexity. The relationship is still linear in the parameters.
The Intercept Trick
Without , the hyperplane is forced to pass through the origin. Most real data doesn't pass through the origin — a house with zero square footage doesn't have zero price in the model's internal representation. The standard fix: append a column of ones to the feature matrix.
For the 1-feature anchor, the design matrix with an intercept column is:
The matrix product gives predictions for all six samples at once:
For the 2-feature anchor, expands to 6×3:
The model now holds for all samples simultaneously. This matrix form is how every linear model is implemented at scale — no loops over samples.
Why This Matters for ML
Every linear model is a hyperplane. Logistic regression uses a hyperplane as a decision boundary — points on one side are class 1, the other class 0. SVMs find the hyperplane with maximum margin. The linear layer in a neural network applies this multiplication at each layer. Understanding the geometry now means every subsequent algorithm is just a variation on how the weights are found.
The next question is: which is best? That requires a loss function.
Geometry Summary
| Dimensions | Equation | Geometric Object | Visualizable? |
|---|---|---|---|
| 1 feature | Line (2D) | Yes | |
| 2 features | Plane (3D) | Yes | |
| 3 features | Hyperplane (4D) | No | |
| features | Hyperplane (D) | No |
Related Concepts and Honest Limitations
The design matrix with a leading column of ones is the same representation used to derive the OLS closed-form solution . Understanding why appears there requires exactly the matrix form developed here.
The limitation is linearity itself. If the true relationship between square footage and price curves — prices rise steeply at first, then plateau — a hyperplane can only approximate it. Polynomial regression adds columns to to handle this, but the model remains linear in the parameters. Genuinely non-linear relationships (e.g., exponential growth, tree-structured decision rules) require a different model class.
Test Your Understanding
-
With and , what is for a house of 1250 sq ft? What is the residual if the true price is $290k?
-
Why does appending a column of ones to allow the model to learn a non-zero intercept? What would happen geometrically if you left it out and the true intercept was $50k?
-
A colleague proposes fitting two separate lines — one for small houses and one for large houses — instead of a single hyperplane. When would this be better, and what model class formalizes that idea?
-
For the 2-feature case, the coefficient means each bedroom adds $15k to price holding sq_ft fixed. How would you confirm this interpretation from the trace table?
-
If you have features and samples, what does the design matrix look like, and why does this cause problems for the OLS formula ?