Back to blog
← View series: machine learning

Can Linear Regression Solve Classification?Logistic Regression: Math Intuition Classification Performance Metrics Multiclass Logistic Regression: OvR (One vs Rest)Logistic Regression: Full Implementation GridSearchCV and RandomizedSearchCV Logistic Regression on Imbalanced Data and ROC Curve Deep Dive

~/blog

Logistic Regression: Math Intuition

Jun 26, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

Logistic regression produces probabilities, not class labels. Understanding where those probabilities come from — and why binary cross-entropy is the right loss — requires tracing the full path from raw linear score to gradient update. Every formula here is computed on concrete numbers.

Anchor dataset: Loan default prediction (6 samples for hand-trace clarity).

python

import numpy as np

X = np.array([25, 32, 45, 75, 95, 110]).reshape(-1, 1)
y = np.array([1,   1,  1,  0,  0,   0])

# Weights after fitting (used for trace — not starting weights):
# w₀ = 8.12, w₁ = -0.094

Step 1: Linear Score → Sigmoid → Probability

The raw linear score is computed the same way as in linear regression:

$z = w_{0} + w_{1} \times income$

The sigmoid function converts this to a probability:

$σ (z) = \frac{1}{1 + e ^{- z}}, σ (z) \in (0, 1)$

The model predicts $P (y = 1∣ x) = σ (z)$ , which is the probability of default given income.

Trace for all 6 samples with $w_{0} = 8.12$ , $w_{1} = - 0.094$ :

Income	$z = 8.12 - 0.094 \times x$	$σ (z) = 1/ (1 + e^{- z})$	Predicted	$y_{true}$
25	8.12 − 2.35 = 5.77	$1/ (1 + e^{- 5.77})$ = 0.997	1	1 ✓
32	8.12 − 3.01 = 5.11	$1/ (1 + e^{- 5.11})$ = 0.994	1	1 ✓
45	8.12 − 4.23 = 3.89	$1/ (1 + e^{- 3.89})$ = 0.980	1	1 ✓
75	8.12 − 7.05 = 1.07	$1/ (1 + e^{- 1.07})$ = 0.745	1	0 ✗
95	8.12 − 8.93 = −0.81	$1/ (1 + e^{0.81})$ = 0.307	0	0 ✓
110	8.12 − 10.34 = −2.22	$1/ (1 + e^{2.22})$ = 0.098	0	0 ✓

At decision threshold 0.5: income = 75 gives $σ = 0.745$ — predicted as default (wrong). These weights are illustrative; the true MLE solution would correctly separate this dataset.

<line x1="50" y1="110" x2="500" y2="110" stroke="#e2e8f0" stroke-width="1" stroke-dasharray="4,3"/>
<text x="44" y="113" font-size="9" fill="#64748b" text-anchor="end">0.5</text>
<line x1="275" y1="15" x2="275" y2="205" stroke="#e2e8f0" stroke-width="1" stroke-dasharray="4,3"/>
<text x="275" y="218" font-size="9" fill="#64748b" text-anchor="middle">0</text>

<line x1="50" y1="25" x2="500" y2="25" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,3"/>
<text x="504" y="28" font-size="9" fill="#ef4444">1</text>
<line x1="50" y1="200" x2="500" y2="200" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,3"/>
<text x="504" y="203" font-size="9" fill="#ef4444">0</text>

<path d="M50,200 C100,200 130,195 160,183 C190,170 210,145 230,120 C248,98 260,70 275,55 C290,40 305,30 330,25 C360,22 400,22 450,21 C470,21 490,21 500,21" fill="none" stroke="#3b82f6" stroke-width="2.5"/>

<circle cx="388" cy="21" r="5" fill="#22c55e"/>
<text x="388" y="14" font-size="8" fill="#22c55e" text-anchor="middle">25k</text>
<circle cx="360" cy="22" r="5" fill="#22c55e"/>
<text x="358" y="35" font-size="8" fill="#22c55e" text-anchor="middle">32k</text>
<circle cx="308" cy="28" r="5" fill="#22c55e"/>
<text x="306" y="41" font-size="8" fill="#22c55e" text-anchor="middle">45k</text>
<circle cx="232" cy="73" r="5" fill="#ef4444"/>
<text x="240" y="68" font-size="8" fill="#ef4444">75k</text>
<circle cx="198" cy="148" r="5" fill="#22c55e"/>
<text x="205" y="145" font-size="8" fill="#22c55e">95k</text>
<circle cx="164" cy="183" r="5" fill="#22c55e"/>
<text x="155" y="178" font-size="8" fill="#22c55e">110k</text>

<line x1="275" y1="25" x2="275" y2="205" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<text x="278" y="95" font-size="8" fill="#94a3b8">decision boundary</text>
<text x="278" y="106" font-size="8" fill="#94a3b8">σ = 0.5</text>

Green dots are correctly classified; the red dot at income=75k sits above the 0.5 line but is a non-defaulter (y=0). The S-shape ensures all outputs stay within (0, 1).

Step 2: Log-Odds (Logit) Interpretation

The ratio of default probability to non-default probability is the odds:

$odds = \frac{P ( y = 1 )}{P ( y = 0 )} = \frac{σ ( z )}{1 - σ ( z )}$

Taking the log of odds recovers the linear score exactly:

$lo g (\frac{P ( y = 1 )}{1 - P ( y = 1 )}) = z = w_{0} + w_{1} \times income$

The log-odds (logit) is linear in the features. This means logistic regression is a linear model — it draws a straight boundary in feature space — just applied to log-odds rather than probability directly.

Log-odds trace for 3 anchor samples:

Income	$σ (z)$	$1 - σ (z)$	Odds	$lo g (odds) = z$
25	0.997	0.003	332.3	$lo g (332.3) = 5.77$ ✓
75	0.745	0.255	2.92	$lo g (2.92) = 1.07$ ✓
110	0.098	0.902	0.109	$lo g (0.109) = - 2.22$ ✓

Interpreting $w_{1} = - 0.094$ : Each additional $1k in income changes the log-odds of default by $- 0.094$ . In odds terms, it multiplies the odds of default by:

$e^{- 0.094} = 0.910$

A 9% reduction in the odds of default for each $1k of additional income.

Step 3: Binary Cross-Entropy Loss

Why not MSE? Consider predicting $y = 1$ with $\overset{p}{^} = 0.999$ (correct, very confident). MSE loss = $(1 - 0.999)^{2} = 0.000001$ — nearly zero. Now predict with $\overset{p}{^} = 0.501$ (correct, barely). MSE = $(1 - 0.501)^{2} = 0.249$ . The nearly-random prediction is penalized 250,000× more than the confident correct one — backward.

Binary cross-entropy (BCE) penalizes wrong confidence, not wrong predictions:

$L = - [y lo g (p) + (1 - y) lo g (1 - p)]$

Three cases:

Correct and confident: $y = 1$ , $p = 0.999$ → $L = - lo g (0.999) = 0.001$ (tiny)
Correct and unconfident: $y = 1$ , $p = 0.51$ → $L = - lo g (0.51) = 0.675$ (meaningful signal)
Wrong and confident: $y = 1$ , $p = 0.001$ → $L = - lo g (0.001) = 6.908$ (large penalty, $lo g$ → $\infty$ )

Per-sample loss for the 6-sample anchor:

Income	$y$	$p = σ (z)$	$L = - [y lo g p + (1 - y) lo g (1 - p)]$
25	1	0.997	$- lo g (0.997) = 0.003$
32	1	0.994	$- lo g (0.994) = 0.006$
45	1	0.980	$- lo g (0.980) = 0.020$
75	0	0.745	$- lo g (1 - 0.745) = - lo g (0.255) = 1.367$
95	0	0.307	$- lo g (1 - 0.307) = - lo g (0.693) = 0.366$
110	0	0.098	$- lo g (1 - 0.098) = - lo g (0.902) = 0.103$
Avg loss			$(0.003 + 0.006 + 0.020 + 1.367 + 0.366 + 0.103) /6 = 0.311$

The sample at income=75 dominates the loss (1.367) because the model is confidently wrong: it predicts $p = 0.745$ (likely default) for a non-defaulter. This is exactly where gradient descent will push the decision boundary.

<rect x="10" y="18" width="250" height="180" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="280" y="18" width="250" height="180" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>

<line x1="10" y1="198" x2="260" y2="198" stroke="#334155" stroke-width="1.5"/>
<line x1="10" y1="18" x2="10" y2="198" stroke="#334155" stroke-width="1.5"/>
<line x1="280" y1="198" x2="530" y2="198" stroke="#334155" stroke-width="1.5"/>
<line x1="280" y1="18" x2="280" y2="198" stroke="#334155" stroke-width="1.5"/>

<text x="135" y="213" text-anchor="middle" font-size="9" fill="#334155">p (predicted probability)</text>
<text x="405" y="213" text-anchor="middle" font-size="9" fill="#334155">p (predicted probability)</text>

<text x="10" y="215" font-size="8" fill="#64748b">0</text>
<text x="257" y="215" font-size="8" fill="#64748b" text-anchor="end">1</text>
<text x="280" y="215" font-size="8" fill="#64748b">0</text>
<text x="527" y="215" font-size="8" fill="#64748b" text-anchor="end">1</text>

<path d="M12,28 C30,30 50,35 80,50 C120,72 150,100 180,130 C210,158 240,186 258,196" fill="none" stroke="#3b82f6" stroke-width="2"/>
<text x="60" y="75" font-size="9" fill="#3b82f6">-log(p)</text>
<text x="215" y="188" font-size="8" fill="#22c55e">low loss</text>
<text x="15" y="45" font-size="8" fill="#ef4444">high loss →∞</text>

<path d="M282,196 C300,186 330,158 360,130 C390,100 420,72 450,50 C480,35 500,30 528,28" fill="none" stroke="#3b82f6" stroke-width="2"/>
<text x="430" y="75" font-size="9" fill="#3b82f6">-log(1-p)</text>
<text x="284" y="190" font-size="8" fill="#22c55e">low loss</text>
<text x="470" y="45" font-size="8" fill="#ef4444">high loss →∞</text>

Left panel ( $y = 1$ ): loss approaches infinity as $p \to 0$ (confidently wrong). Right panel ( $y = 0$ ): loss approaches infinity as $p \to 1$ . In both cases, confident correct predictions have loss near zero.

Step 4: Gradient Descent Update

The total cost over $n$ samples:

$J (w) = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} lo g (p_{i}) + (1 - y_{i}) lo g (1 - p_{i})]$

The gradients work out elegantly — the same form as linear regression but with sigmoid probabilities:

$\frac{\partial J}{\partial w _{0}} = \frac{1}{n} \sum_{i = 1}^{n} (p_{i} - y_{i})$

$\frac{\partial J}{\partial w _{1}} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} (p_{i} - y_{i})$

One gradient step from $w_{0} = 0$ , $w_{1} = 0$ , $α = 0.01$ :

With all weights zero: $z = 0$ for every sample, so $p_{i} = σ (0) = 0.5$ .

Prediction errors $p_{i} - y_{i}$ :

Income	$p_{i} = 0.5$	$y_{i}$	$p_{i} - y_{i}$
25	0.5	1	−0.5
32	0.5	1	−0.5
45	0.5	1	−0.5
75	0.5	0	+0.5
95	0.5	0	+0.5
110	0.5	0	+0.5

Computing gradients:

$\frac{\partial J}{\partial w _{0}} = \frac{- 0.5 - 0.5 - 0.5 + 0.5 + 0.5 + 0.5}{6} = \frac{0}{6} = 0.0$

$\frac{\partial J}{\partial w _{1}} = \frac{25 ( - 0.5 ) + 32 ( - 0.5 ) + 45 ( - 0.5 ) + 75 ( 0.5 ) + 95 ( 0.5 ) + 110 ( 0.5 )}{6}$

$= \frac{- 12.5 - 16.0 - 22.5 + 37.5 + 47.5 + 55.0}{6} = \frac{89.0}{6} = 14.83$

Weight updates:

$w_{0} \leftarrow 0.0 - 0.01 \times 0.0 = 0.0$

$w_{1} \leftarrow 0.0 - 0.01 \times 14.83 = - 0.1483$

The $w_{0}$ gradient is zero because the dataset is balanced (3 defaulters, 3 non-defaulters) — symmetry cancels. The $w_{1}$ gradient is positive (14.83) because higher incomes are associated with non-default ( $y = 0$ ), so the gradient pushes $w_{1}$ negative — exactly right. After this one step, the model already knows to decrease $P (default)$ as income increases.

Key Formulas Reference

Step	Formula	Purpose
Linear score	$z = w_{0} + w_{1} x$	Raw activation
Sigmoid	$σ (z) = 1/ (1 + e^{- z})$	Maps $z \to$ probability
Log-odds	$lo g (p / (1 - p)) = z$	Linear interpretation
Per-sample loss	$- [y lo g p + (1 - y) lo g (1 - p)]$	Penalizes wrong confidence
Gradient	$(1/ n) \sum x_{i} (p_{i} - y_{i})$	Same form as linear regression

Test Your Understanding

The gradient $\partial J / \partial w_{0} = 0$ at initialization because the dataset is balanced. If you added one more defaulter (making 4 defaulters, 3 non-defaulters), what sign would $\partial J / \partial w_{0}$ have, and what does that mean for $w_{0}$ 's update?
At income=75, the loss is 1.367 — the largest single-sample loss. After one gradient step ( $w_{1} = - 0.1483$ ), compute the new $z$ for income=75 and the new $σ (z)$ . Did the loss for this sample decrease?
The gradient of BCE with respect to weights has the form $(1/ n) \sum x_{i} (p_{i} - y_{i})$ — identical in structure to the linear regression gradient. Why does the same form emerge from a completely different loss function?
At $w_{1} = - 0.094$ (the approximate fitted weights), the decision boundary (income where $P = 0.5$ ) is at $z = 0 \Rightarrow w_{0} + w_{1} x = 0 \Rightarrow x = - w_{0} / w_{1}$ . Compute this boundary income value. Does it match your intuition from the data?
The log-odds interpretation says each $1k income multiplies the odds of default by $e^{- 0.094} = 0.91$ . If income doubles from 50 to 100, what is the ratio of the odds at 100 to the odds at 50?