Back to blog
← View series: machine learning

~/blog

Simple Linear Regression

Jun 25, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

Simple linear regression is the most constrained version of the problem: one input, one output, find the best line. The derivation is short enough to do by hand, and doing it by hand is how you internalize why the slope formula involves covariance and variance — not just what the formula is.

The Model

is the intercept, is the slope. "Simple" means one predictor. For every additional square foot, the predicted price increases by thousand dollars.

Anchor dataset:

python
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([650, 850, 1100, 1400, 1600, 1900]).reshape(-1, 1)
y = np.array([180, 220, 280, 340, 370, 430])

The Loss: Sum of Squared Errors

The residual for sample is .

The loss is the sum of squared errors:

Why squared? It penalizes large errors more than small ones, treats over- and under-prediction symmetrically, and is differentiable everywhere — properties that allow a closed-form solution.

As a baseline, try the worst possible model: (always predict zero):

Our goal is to find and that drive SSE far below this.

Deriving the OLS Formulas

Set the partial derivatives to zero:

\frac{\partial \text{SSE}}{\partial w_0} = -2 \sum (y_i - w_0 - w_1 x_i) = 0 \implies n w_0 + w_1 \sum x_i = \sum y_i \tag{1}

\frac{\partial \text{SSE}}{\partial w_1} = -2 \sum x_i(y_i - w_0 - w_1 x_i) = 0 \implies w_0 \sum x_i + w_1 \sum x_i^2 = \sum x_i y_i \tag{2}

Solving this system yields the OLS normal equations:

The slope is the covariance between and divided by the variance of — how much moves per unit movement in , adjusted for 's own spread.

Computing OLS by Hand on the Anchor

Step 1: Summary statistics

Step 2: Per-sample products

650180−600−123.3336000074000
850220−400−83.3316000033333
1100280−150−23.33225003500
140034015036.67225005500
160037035066.6712250023333
1900430650126.6742250082333
Σ1110000222000

Step 3: Compute weights

Final model:

Predictions and Residuals

650180183.33−3.3311.09
850220223.33−3.3311.09
1100280273.336.6744.49
1400340333.336.6744.49
1600370373.33−3.3311.09
1900430433.33−3.3311.09
SSE133.33

SSE dropped from 596600 (naive model) to 133.33 — a 4500× reduction. No other linear model can achieve lower SSE on this data; OLS is provably optimal among all unbiased linear estimators.

<line x1="60" y1="280" x2="520" y2="280" stroke="#334155" stroke-width="1.5"/> <line x1="60" y1="20" x2="60" y2="280" stroke="#334155" stroke-width="1.5"/> <text x="290" y="312" text-anchor="middle" font-size="12" fill="#334155">sq_ft</text> <text x="20" y="150" text-anchor="middle" font-size="12" fill="#334155" transform="rotate(-90,20,150)">price ($k)</text> <text x="65" y="292" font-size="10" fill="#64748b">650</text> <text x="165" y="292" font-size="10" fill="#64748b">900</text> <text x="265" y="292" font-size="10" fill="#64748b">1150</text> <text x="365" y="292" font-size="10" fill="#64748b">1550</text> <text x="465" y="292" font-size="10" fill="#64748b">1900</text> <line x1="60" y1="268" x2="520" y2="28" stroke="#3b82f6" stroke-width="1.8"/> <text x="255" y="50" font-size="10" fill="#3b82f6">ŷ = 53.33 + 0.20·sqft</text> <text x="75" y="262" font-size="9" fill="#3b82f6">w₀=53.33</text> <circle cx="113" cy="235" r="5" fill="#1d4ed8"/> <line x1="113" y1="235" x2="113" y2="228" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/> <text x="120" y="233" font-size="9" fill="#f59e0b">−3.3</text> <circle cx="163" cy="205" r="5" fill="#1d4ed8"/> <line x1="163" y1="205" x2="163" y2="198" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/> <text x="170" y="203" font-size="9" fill="#f59e0b">−3.3</text> <circle cx="230" cy="155" r="5" fill="#1d4ed8"/> <line x1="230" y1="155" x2="230" y2="162" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/> <text x="237" y="165" font-size="9" fill="#f59e0b">+6.7</text> <circle cx="313" cy="105" r="5" fill="#1d4ed8"/> <line x1="313" y1="105" x2="313" y2="112" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/> <text x="320" y="118" font-size="9" fill="#f59e0b">+6.7</text> <circle cx="363" cy="80" r="5" fill="#1d4ed8"/> <line x1="363" y1="80" x2="363" y2="73" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/> <text x="370" y="77" font-size="9" fill="#f59e0b">−3.3</text> <circle cx="438" cy="40" r="5" fill="#1d4ed8"/> <line x1="438" y1="40" x2="438" y2="33" stroke="#f59e0b" stroke-width="1.5" stroke-dasharray="3,2"/> <text x="445" y="37" font-size="9" fill="#f59e0b">−3.3</text> <line x1="230" y1="148" x2="280" y2="136" stroke="#94a3b8" stroke-width="1" stroke-dasharray="4,3"/> <line x1="280" y1="148" x2="280" y2="136" stroke="#64748b" stroke-width="1"/> <text x="285" y="144" font-size="9" fill="#64748b">rise/run=0.20</text>

Code Implementation

python
model = LinearRegression()
model.fit(X, y)

print(f"w₀ (intercept): {model.intercept_:.2f}")
print(f"w₁ (slope):     {model.coef_[0]:.4f}")
w₀ (intercept): 53.33 w₁ (slope): 0.2000
python
new_house = np.array([[1000]])
print(f"Predicted price for 1000 sq_ft: ${model.predict(new_house)[0]:.1f}k")
Predicted price for 1000 sq_ft: $253.3k

Manual check: 53.33 + 0.20 \times 1000 = \253.3k$ ✓

Assumptions

Four conditions are required for OLS to behave well:

  1. Linearity — the true relationship is .
  2. Independence — residuals are independent across samples.
  3. Homoscedasticity — the variance of residuals is constant (doesn't grow with ).
  4. Normality — residuals follow a normal distribution (needed for inference, not prediction).

Violation check: plot residuals vs fitted values. Random scatter around zero confirms the assumptions. A fan shape means heteroscedasticity. A curved pattern means the linearity assumption is wrong.

Key Formulas

FormulaExpression
Slope
Intercept
Prediction
SSE

The OLS formula is a closed-form solution unique to linear regression on MSE. For logistic regression or neural networks, the loss is no longer quadratic in the weights, so no closed form exists — gradient descent is required. Multiple linear regression extends this to features using the matrix form , which reduces to the formulas here when .

The limitation of simple linear regression is its linearity. If the true relationship between sq_ft and price is non-linear — plateauing at large houses, for example — the best-fit line will have systematic residual patterns (curvature in the residual-vs-fitted plot). That's the signal to add polynomial features or switch model class.

Test Your Understanding

  1. For our 6-sample anchor, verify that the sum of residuals . Is this always true for OLS? Why?

  2. If you doubled all values (prices in \2k$kw_0w_1$ change? Use the OLS formulas to derive this, not trial and error.

  3. The SSE dropped from 596600 to 133.33 with OLS weights. Could you find weights that give SSE = 0? If not, why not? If yes, what would that mean about the data?

  4. What is the residual for a house with sq_ft = 1250? Is the model over- or under-predicting?

  5. A colleague argues that maximizing is the same as minimizing SSE. Are they correct? Under what conditions would these objectives give different results?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment