Simple Linear Regression

Machine LearningAIData Science

Simple linear regression is the most constrained version of the problem: one input, one output, find the best line. The derivation is short enough to do by hand, and doing it by hand is how you internalize why the slope formula involves covariance and variance — not just what the formula is.

The Model

$\overset{y}{^} = w_{0} + w_{1} x$

$w_{0}$ is the intercept, $w_{1}$ is the slope. "Simple" means one predictor. For every additional square foot, the predicted price increases by $w_{1}$ thousand dollars.

Anchor dataset:

python

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])

The Loss: Sum of Squared Errors

The residual for sample $i$ is $ε_{i} = y_{i} - \overset{y}{^}_{i} = y_{i} - (w_{0} + w_{1} x_{i})$ .

The loss is the sum of squared errors:

$SSE = \sum_{i = 1}^{n} ε_{i}^{2} = \sum_{i = 1}^{n} (y_{i} - w_{0} - w_{1} x_{i})^{2}$

Why squared? It penalizes large errors more than small ones, treats over- and under-prediction symmetrically, and is differentiable everywhere — properties that allow a closed-form solution.

As a baseline, try the worst possible model: $w_{0} = w_{1} = 0$ (always predict zero):

$SSE_{naive} = 18 0^{2} + 22 0^{2} + 28 0^{2} + 34 0^{2} + 37 0^{2} + 43 0^{2}$ $= 32400 + 48400 + 78400 + 115600 + 136900 + 184900 = 596600$

Our goal is to find $w_{0}$ and $w_{1}$ that drive SSE far below this.

Deriving the OLS Formulas

Set the partial derivatives to zero:

$\frac{\partial \text{SSE}}{\partial w_0} = -2 \sum (y_i - w_0 - w_1 x_i) = 0 \implies n w_0 + w_1 \sum x_i = \sum y_i \tag{1}$

$\frac{\partial \text{SSE}}{\partial w_1} = -2 \sum x_i(y_i - w_0 - w_1 x_i) = 0 \implies w_0 \sum x_i + w_1 \sum x_i^2 = \sum x_i y_i \tag{2}$

Solving this $2 \times 2$ system yields the OLS normal equations:

$w_{1} = \frac{\sum x _{i} y _{i} - n x ˉ y ˉ}{\sum x _{i}^{2} - n x ˉ ^{2}} = \frac{Cov ( x , y )}{Var ( x )}$

$w_{0} = \overset{y}{ˉ} - w_{1} \overset{x}{ˉ}$

The slope is the covariance between $x$ and $y$ divided by the variance of $x$ — how much $y$ moves per unit movement in $x$ , adjusted for $x$ 's own spread.

Computing OLS by Hand on the Anchor

Step 1: Summary statistics

$\overset{x}{ˉ} = \frac{650 + 850 + 1100 + 1400 + 1600 + 1900}{6} = \frac{7500}{6} = 1250$

$\overset{y}{ˉ} = \frac{180 + 220 + 280 + 340 + 370 + 430}{6} = \frac{1820}{6} = 303.33$

Step 2: Per-sample products

$x_{i}$	$y_{i}$	$x_{i} - \overset{x}{ˉ}$	$y_{i} - \overset{y}{ˉ}$	$(x_{i} - \overset{x}{ˉ})^{2}$	$(x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ})$
650	180	−600	−123.33	360000	74000
850	220	−400	−83.33	160000	33333
1100	280	−150	−23.33	22500	3500
1400	340	150	36.67	22500	5500
1600	370	350	66.67	122500	23333
1900	430	650	126.67	422500	82333
Σ				1110000	222000

Step 3: Compute weights

$w_{1} = \frac{222000}{1110000} = 0.2000$

$w_{0} = 303.33 - 0.2000 \times 1250 = 303.33 - 250 = 53.33$

Final model: $\overset{y}{^} = 53.33 + 0.20 \times sq_ft$

Predictions and Residuals

$x_{i}$	$y_{i}$	$\overset{y}{^}_{i} = 53.33 + 0.20 x_{i}$	$ε_{i} = y_{i} - \overset{y}{^}_{i}$	$ε_{i}^{2}$
650	180	183.33	−3.33	11.09
850	220	223.33	−3.33	11.09
1100	280	273.33	6.67	44.49
1400	340	333.33	6.67	44.49
1600	370	373.33	−3.33	11.09
1900	430	433.33	−3.33	11.09
			SSE	133.33

SSE dropped from 596600 (naive model) to 133.33 — a 4500× reduction. No other linear model can achieve lower SSE on this data; OLS is provably optimal among all unbiased linear estimators.

<line x1="60" y1="280" x2="520" y2="280" stroke="#334155" stroke-width="1.5"/>
<line x1="60" y1="20" x2="60" y2="280" stroke="#334155" stroke-width="1.5"/>

<text x="290" y="312" text-anchor="middle" font-size="12" fill="#334155">sq_ft</text>
<text x="20" y="150" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,20,150)">price ($k)</text>

<text x="65" y="292" font-size="10" fill="#64748b">650</text>
<text x="165" y="292" font-size="10" fill="#64748b">900</text>
<text x="265" y="292" font-size="10" fill="#64748b">1150</text>
<text x="365" y="292" font-size="10" fill="#64748b">1550</text>
<text x="465" y="292" font-size="10" fill="#64748b">1900</text>

<line x1="60" y1="268" x2="520" y2="28" stroke="#3b82f6" stroke-width="1.8"/>

<text x="255" y="50" font-size="10" fill="#3b82f6">ŷ = 53.33 + 0.20·sqft</text>

<text x="75" y="262" font-size="9" fill="#3b82f6">w₀=53.33</text>

<circle cx="113" cy="235" r="5" fill="#1d4ed8"/>
<line x1="113" y1="235" x2="113" y2="228" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="120" y="233" font-size="9" fill="#f59e0b">−3.3</text>

<circle cx="163" cy="205" r="5" fill="#1d4ed8"/>
<line x1="163" y1="205" x2="163" y2="198" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="170" y="203" font-size="9" fill="#f59e0b">−3.3</text>

<circle cx="230" cy="155" r="5" fill="#1d4ed8"/>
<line x1="230" y1="155" x2="230" y2="162" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="237" y="165" font-size="9" fill="#f59e0b">+6.7</text>

<circle cx="313" cy="105" r="5" fill="#1d4ed8"/>
<line x1="313" y1="105" x2="313" y2="112" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="320" y="118" font-size="9" fill="#f59e0b">+6.7</text>

<circle cx="363" cy="80" r="5" fill="#1d4ed8"/>
<line x1="363" y1="80" x2="363" y2="73" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="370" y="77" font-size="9" fill="#f59e0b">−3.3</text>

<circle cx="438" cy="40" r="5" fill="#1d4ed8"/>
<line x1="438" y1="40" x2="438" y2="33" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/>
<text x="445" y="37" font-size="9" fill="#f59e0b">−3.3</text>

<line x1="230" y1="148" x2="280" y2="136" stroke="#94a3b8" stroke-width="1" stroke-dasharray="4,3"/>
<line x1="280" y1="148" x2="280" y2="136" stroke="#64748b" stroke-width="1"/>
<text x="285" y="144" font-size="9" fill="#64748b">rise/run=0.20</text>

Code Implementation

python

model = LinearRegression()
model.fit(X, y)

print(f"w₀ (intercept): {model.intercept_:.2f}")
print(f"w₁ (slope):     {model.coef_[0]:.4f}")

w₀ (intercept): 53.33
w₁ (slope):     0.2000

python

new_house = np.array([[1000]])
print(f"Predicted price for 1000 sq_ft: ${model.predict(new_house)[0]:.1f}k")

Predicted price for 1000 sq_ft: $253.3k

Manual check: $53.33 + 0.20 \times 1000 = \$ 253.3k$ ✓

Assumptions

Four conditions are required for OLS to behave well:

Linearity — the true relationship is $E [y ∣ x] = w_{0} + w_{1} x$ .
Independence — residuals are independent across samples.
Homoscedasticity — the variance of residuals is constant (doesn't grow with $x$ ).
Normality — residuals follow a normal distribution (needed for inference, not prediction).

Violation check: plot residuals vs fitted values. Random scatter around zero confirms the assumptions. A fan shape means heteroscedasticity. A curved pattern means the linearity assumption is wrong.

Key Formulas

Formula	Expression
Slope	$w_{1} = Cov (x, y) / Var (x) = \sum (x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ}) / \sum (x_{i} - \overset{x}{ˉ})^{2}$
Intercept	$w_{0} = \overset{y}{ˉ} - w_{1} \overset{x}{ˉ}$
Prediction	$\overset{y}{^} = w_{0} + w_{1} x$
SSE	$\sum (y_{i} - \overset{y}{^}_{i})^{2}$

The OLS formula is a closed-form solution unique to linear regression on MSE. For logistic regression or neural networks, the loss is no longer quadratic in the weights, so no closed form exists — gradient descent is required. Multiple linear regression extends this to $p$ features using the matrix form $w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$ , which reduces to the formulas here when $p = 1$ .

The limitation of simple linear regression is its linearity. If the true relationship between sq_ft and price is non-linear — plateauing at large houses, for example — the best-fit line will have systematic residual patterns (curvature in the residual-vs-fitted plot). That's the signal to add polynomial features or switch model class.

Test Your Understanding

For our 6-sample anchor, verify that the sum of residuals $\sum ε_{i} = 0$ . Is this always true for OLS? Why?
If you doubled all $y$ values (prices in $\$ 2k $in cr e m e n t s in s t e a d o f$ $k $), h o w w o u l d$ w_0 $an d$ w_1$ change? Use the OLS formulas to derive this, not trial and error.
The SSE dropped from 596600 to 133.33 with OLS weights. Could you find weights that give SSE = 0? If not, why not? If yes, what would that mean about the data?
What is the residual for a house with sq_ft = 1250? Is the model over- or under-predicting?
A colleague argues that maximizing $R^{2}$ is the same as minimizing SSE. Are they correct? Under what conditions would these objectives give different results?

Simple Linear Regression

The Model

The Loss: Sum of Squared Errors

Deriving the OLS Formulas

Computing OLS by Hand on the Anchor

Predictions and Residuals

Code Implementation

Assumptions

Key Formulas

Test Your Understanding

Comments (0)

Leave a comment

Simple Linear Regression

The Model

The Loss: Sum of Squared Errors

Deriving the OLS Formulas

Computing OLS by Hand on the Anchor

Predictions and Residuals

Code Implementation

Assumptions

Key Formulas

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment