← View series: machine learning
~/blog
Simple Linear Regression
Simple linear regression is the most constrained version of the problem: one input, one output, find the best line. The derivation is short enough to do by hand, and doing it by hand is how you internalize why the slope formula involves covariance and variance — not just what the formula is.
The Model
is the intercept, is the slope. "Simple" means one predictor. For every additional square foot, the predicted price increases by thousand dollars.
Anchor dataset:
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])The Loss: Sum of Squared Errors
The residual for sample is .
The loss is the sum of squared errors:
Why squared? It penalizes large errors more than small ones, treats over- and under-prediction symmetrically, and is differentiable everywhere — properties that allow a closed-form solution.
As a baseline, try the worst possible model: (always predict zero):
Our goal is to find and that drive SSE far below this.
Deriving the OLS Formulas
Set the partial derivatives to zero:
\frac{\partial \text{SSE}}{\partial w_0} = -2 \sum (y_i - w_0 - w_1 x_i) = 0 \implies n w_0 + w_1 \sum x_i = \sum y_i \tag{1}
\frac{\partial \text{SSE}}{\partial w_1} = -2 \sum x_i(y_i - w_0 - w_1 x_i) = 0 \implies w_0 \sum x_i + w_1 \sum x_i^2 = \sum x_i y_i \tag{2}
Solving this system yields the OLS normal equations:
The slope is the covariance between and divided by the variance of — how much moves per unit movement in , adjusted for 's own spread.
Computing OLS by Hand on the Anchor
Step 1: Summary statistics
Step 2: Per-sample products
| 650 | 180 | −600 | −123.33 | 360000 | 74000 |
| 850 | 220 | −400 | −83.33 | 160000 | 33333 |
| 1100 | 280 | −150 | −23.33 | 22500 | 3500 |
| 1400 | 340 | 150 | 36.67 | 22500 | 5500 |
| 1600 | 370 | 350 | 66.67 | 122500 | 23333 |
| 1900 | 430 | 650 | 126.67 | 422500 | 82333 |
| Σ | 1110000 | 222000 |
Step 3: Compute weights
Final model:
Predictions and Residuals
| 650 | 180 | 183.33 | −3.33 | 11.09 |
| 850 | 220 | 223.33 | −3.33 | 11.09 |
| 1100 | 280 | 273.33 | 6.67 | 44.49 |
| 1400 | 340 | 333.33 | 6.67 | 44.49 |
| 1600 | 370 | 373.33 | −3.33 | 11.09 |
| 1900 | 430 | 433.33 | −3.33 | 11.09 |
| SSE | 133.33 |
SSE dropped from 596600 (naive model) to 133.33 — a 4500× reduction. No other linear model can achieve lower SSE on this data; OLS is provably optimal among all unbiased linear estimators.
<line x1="60" y1="280" x2="520" y2="280" stroke="#334155" stroke-width="1.5"/>
<line x1="60" y1="20" x2="60" y2="280" stroke="#334155" stroke-width="1.5"/>
<text x="290" y="312" text-anchor="middle" font-size="12" fill="#334155">sq_ft</text>
<text x="20" y="150" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,20,150)">price ($k)</text>
<text x="65" y="292" font-size="10" fill="#64748b">650</text>
<text x="165" y="292" font-size="10" fill="#64748b">900</text>
<text x="265" y="292" font-size="10" fill="#64748b">1150</text>
<text x="365" y="292" font-size="10" fill="#64748b">1550</text>
<text x="465" y="292" font-size="10" fill="#64748b">1900</text>
<line x1="60" y1="268" x2="520" y2="28" stroke="#3b82f6" stroke-width="1.8"/>
<text x="255" y="50" font-size="10" fill="#3b82f6">ŷ = 53.33 + 0.20·sqft</text>
<text x="75" y="262" font-size="9" fill="#3b82f6">w₀=53.33</text>
<circle cx="113" cy="235" r="5" fill="#1d4ed8"/>
<line x1="113" y1="235" x2="113" y2="228" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="120" y="233" font-size="9" fill="#f59e0b">−3.3</text>
<circle cx="163" cy="205" r="5" fill="#1d4ed8"/>
<line x1="163" y1="205" x2="163" y2="198" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="170" y="203" font-size="9" fill="#f59e0b">−3.3</text>
<circle cx="230" cy="155" r="5" fill="#1d4ed8"/>
<line x1="230" y1="155" x2="230" y2="162" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="237" y="165" font-size="9" fill="#f59e0b">+6.7</text>
<circle cx="313" cy="105" r="5" fill="#1d4ed8"/>
<line x1="313" y1="105" x2="313" y2="112" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="320" y="118" font-size="9" fill="#f59e0b">+6.7</text>
<circle cx="363" cy="80" r="5" fill="#1d4ed8"/>
<line x1="363" y1="80" x2="363" y2="73" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="370" y="77" font-size="9" fill="#f59e0b">−3.3</text>
<circle cx="438" cy="40" r="5" fill="#1d4ed8"/>
<line x1="438" y1="40" x2="438" y2="33" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="445" y="37" font-size="9" fill="#f59e0b">−3.3</text>
<line x1="230" y1="148" x2="280" y2="136" stroke="#94a3b8" stroke-width="1" stroke-dasharray="4,3"/>
<line x1="280" y1="148" x2="280" y2="136" stroke="#64748b" stroke-width="1"/>
<text x="285" y="144" font-size="9" fill="#64748b">rise/run=0.20</text>
Code Implementation
model = LinearRegression()
model.fit(X, y)
print(f"w₀ (intercept): {model.intercept_:.2f}")
print(f"w₁ (slope): {model.coef_[0]:.4f}")w₀ (intercept): 53.33
w₁ (slope): 0.2000
new_house = np.array([[1000]])
print(f"Predicted price for 1000 sq_ft: ${model.predict(new_house)[0]:.1f}k")Predicted price for 1000 sq_ft: $253.3k
Manual check: 53.33 + 0.20 \times 1000 = \253.3k$ ✓
Assumptions
Four conditions are required for OLS to behave well:
- Linearity — the true relationship is .
- Independence — residuals are independent across samples.
- Homoscedasticity — the variance of residuals is constant (doesn't grow with ).
- Normality — residuals follow a normal distribution (needed for inference, not prediction).
Violation check: plot residuals vs fitted values. Random scatter around zero confirms the assumptions. A fan shape means heteroscedasticity. A curved pattern means the linearity assumption is wrong.
Key Formulas
| Formula | Expression |
|---|---|
| Slope | |
| Intercept | |
| Prediction | |
| SSE |
Related Concepts and Honest Limitations
The OLS formula is a closed-form solution unique to linear regression on MSE. For logistic regression or neural networks, the loss is no longer quadratic in the weights, so no closed form exists — gradient descent is required. Multiple linear regression extends this to features using the matrix form , which reduces to the formulas here when .
The limitation of simple linear regression is its linearity. If the true relationship between sq_ft and price is non-linear — plateauing at large houses, for example — the best-fit line will have systematic residual patterns (curvature in the residual-vs-fitted plot). That's the signal to add polynomial features or switch model class.
Test Your Understanding
-
For our 6-sample anchor, verify that the sum of residuals . Is this always true for OLS? Why?
-
If you doubled all values (prices in \2k$kw_0w_1$ change? Use the OLS formulas to derive this, not trial and error.
-
The SSE dropped from 596600 to 133.33 with OLS weights. Could you find weights that give SSE = 0? If not, why not? If yes, what would that mean about the data?
-
What is the residual for a house with sq_ft = 1250? Is the model over- or under-predicting?
-
A colleague argues that maximizing is the same as minimizing SSE. Are they correct? Under what conditions would these objectives give different results?