← View series: machine learning
~/blog
Logistic Regression: Math Intuition
Logistic regression produces probabilities, not class labels. Understanding where those probabilities come from — and why binary cross-entropy is the right loss — requires tracing the full path from raw linear score to gradient update. Every formula here is computed on concrete numbers.
Anchor dataset: Loan default prediction (6 samples for hand-trace clarity).
import numpy as np
X = np.array([25, 32, 45, 75, 95, 110]).reshape(-1, 1)
y = np.array([1, 1, 1, 0, 0, 0])
# Weights after fitting (used for trace — not starting weights):
# w₀ = 8.12, w₁ = -0.094Step 1: Linear Score → Sigmoid → Probability
The raw linear score is computed the same way as in linear regression:
The sigmoid function converts this to a probability:
The model predicts , which is the probability of default given income.
Trace for all 6 samples with , :
| Income | Predicted | |||
|---|---|---|---|---|
| 25 | 8.12 − 2.35 = 5.77 | = 0.997 | 1 | 1 ✓ |
| 32 | 8.12 − 3.01 = 5.11 | = 0.994 | 1 | 1 ✓ |
| 45 | 8.12 − 4.23 = 3.89 | = 0.980 | 1 | 1 ✓ |
| 75 | 8.12 − 7.05 = 1.07 | = 0.745 | 1 | 0 ✗ |
| 95 | 8.12 − 8.93 = −0.81 | = 0.307 | 0 | 0 ✓ |
| 110 | 8.12 − 10.34 = −2.22 | = 0.098 | 0 | 0 ✓ |
At decision threshold 0.5: income = 75 gives — predicted as default (wrong). These weights are illustrative; the true MLE solution would correctly separate this dataset.
<line x1="50" y1="110" x2="500" y2="110" stroke="#e2e8f0" stroke-width="1" stroke-dasharray="4,3"/>
<text x="44" y="113" font-size="9" fill="#64748b" text-anchor="end">0.5</text>
<line x1="275" y1="15" x2="275" y2="205" stroke="#e2e8f0" stroke-width="1" stroke-dasharray="4,3"/>
<text x="275" y="218" font-size="9" fill="#64748b" text-anchor="middle">0</text>
<line x1="50" y1="25" x2="500" y2="25" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,3"/>
<text x="504" y="28" font-size="9" fill="#ef4444">1</text>
<line x1="50" y1="200" x2="500" y2="200" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,3"/>
<text x="504" y="203" font-size="9" fill="#ef4444">0</text>
<path d="M50,200 C100,200 130,195 160,183 C190,170 210,145 230,120 C248,98 260,70 275,55 C290,40 305,30 330,25 C360,22 400,22 450,21 C470,21 490,21 500,21" fill="none" stroke="#3b82f6" stroke-width="2.5"/>
<circle cx="388" cy="21" r="5" fill="#22c55e"/>
<text x="388" y="14" font-size="8" fill="#22c55e" text-anchor="middle">25k</text>
<circle cx="360" cy="22" r="5" fill="#22c55e"/>
<text x="358" y="35" font-size="8" fill="#22c55e" text-anchor="middle">32k</text>
<circle cx="308" cy="28" r="5" fill="#22c55e"/>
<text x="306" y="41" font-size="8" fill="#22c55e" text-anchor="middle">45k</text>
<circle cx="232" cy="73" r="5" fill="#ef4444"/>
<text x="240" y="68" font-size="8" fill="#ef4444">75k</text>
<circle cx="198" cy="148" r="5" fill="#22c55e"/>
<text x="205" y="145" font-size="8" fill="#22c55e">95k</text>
<circle cx="164" cy="183" r="5" fill="#22c55e"/>
<text x="155" y="178" font-size="8" fill="#22c55e">110k</text>
<line x1="275" y1="25" x2="275" y2="205" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/>
<text x="278" y="95" font-size="8" fill="#94a3b8">decision boundary</text>
<text x="278" y="106" font-size="8" fill="#94a3b8">σ = 0.5</text>
Green dots are correctly classified; the red dot at income=75k sits above the 0.5 line but is a non-defaulter (y=0). The S-shape ensures all outputs stay within (0, 1).
Step 2: Log-Odds (Logit) Interpretation
The ratio of default probability to non-default probability is the odds:
Taking the log of odds recovers the linear score exactly:
The log-odds (logit) is linear in the features. This means logistic regression is a linear model — it draws a straight boundary in feature space — just applied to log-odds rather than probability directly.
Log-odds trace for 3 anchor samples:
| Income | Odds | |||
|---|---|---|---|---|
| 25 | 0.997 | 0.003 | 332.3 | ✓ |
| 75 | 0.745 | 0.255 | 2.92 | ✓ |
| 110 | 0.098 | 0.902 | 0.109 | ✓ |
Interpreting : Each additional $1k in income changes the log-odds of default by . In odds terms, it multiplies the odds of default by:
A 9% reduction in the odds of default for each $1k of additional income.
Step 3: Binary Cross-Entropy Loss
Why not MSE? Consider predicting with (correct, very confident). MSE loss = — nearly zero. Now predict with (correct, barely). MSE = . The nearly-random prediction is penalized 250,000× more than the confident correct one — backward.
Binary cross-entropy (BCE) penalizes wrong confidence, not wrong predictions:
Three cases:
- Correct and confident: , → (tiny)
- Correct and unconfident: , → (meaningful signal)
- Wrong and confident: , → (large penalty, → )
Per-sample loss for the 6-sample anchor:
| Income | |||
|---|---|---|---|
| 25 | 1 | 0.997 | |
| 32 | 1 | 0.994 | |
| 45 | 1 | 0.980 | |
| 75 | 0 | 0.745 | |
| 95 | 0 | 0.307 | |
| 110 | 0 | 0.098 | |
| Avg loss |
The sample at income=75 dominates the loss (1.367) because the model is confidently wrong: it predicts (likely default) for a non-defaulter. This is exactly where gradient descent will push the decision boundary.
<rect x="10" y="18" width="250" height="180" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<rect x="280" y="18" width="250" height="180" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/>
<line x1="10" y1="198" x2="260" y2="198" stroke="#334155" stroke-width="1.5"/>
<line x1="10" y1="18" x2="10" y2="198" stroke="#334155" stroke-width="1.5"/>
<line x1="280" y1="198" x2="530" y2="198" stroke="#334155" stroke-width="1.5"/>
<line x1="280" y1="18" x2="280" y2="198" stroke="#334155" stroke-width="1.5"/>
<text x="135" y="213" text-anchor="middle" font-size="9" fill="#334155">p (predicted probability)</text>
<text x="405" y="213" text-anchor="middle" font-size="9" fill="#334155">p (predicted probability)</text>
<text x="10" y="215" font-size="8" fill="#64748b">0</text>
<text x="257" y="215" font-size="8" fill="#64748b" text-anchor="end">1</text>
<text x="280" y="215" font-size="8" fill="#64748b">0</text>
<text x="527" y="215" font-size="8" fill="#64748b" text-anchor="end">1</text>
<path d="M12,28 C30,30 50,35 80,50 C120,72 150,100 180,130 C210,158 240,186 258,196" fill="none" stroke="#3b82f6" stroke-width="2"/>
<text x="60" y="75" font-size="9" fill="#3b82f6">-log(p)</text>
<text x="215" y="188" font-size="8" fill="#22c55e">low loss</text>
<text x="15" y="45" font-size="8" fill="#ef4444">high loss →∞</text>
<path d="M282,196 C300,186 330,158 360,130 C390,100 420,72 450,50 C480,35 500,30 528,28" fill="none" stroke="#3b82f6" stroke-width="2"/>
<text x="430" y="75" font-size="9" fill="#3b82f6">-log(1-p)</text>
<text x="284" y="190" font-size="8" fill="#22c55e">low loss</text>
<text x="470" y="45" font-size="8" fill="#ef4444">high loss →∞</text>
Left panel (): loss approaches infinity as (confidently wrong). Right panel (): loss approaches infinity as . In both cases, confident correct predictions have loss near zero.
Step 4: Gradient Descent Update
The total cost over samples:
The gradients work out elegantly — the same form as linear regression but with sigmoid probabilities:
One gradient step from , , :
With all weights zero: for every sample, so .
Prediction errors :
| Income | |||
|---|---|---|---|
| 25 | 0.5 | 1 | −0.5 |
| 32 | 0.5 | 1 | −0.5 |
| 45 | 0.5 | 1 | −0.5 |
| 75 | 0.5 | 0 | +0.5 |
| 95 | 0.5 | 0 | +0.5 |
| 110 | 0.5 | 0 | +0.5 |
Computing gradients:
Weight updates:
The gradient is zero because the dataset is balanced (3 defaulters, 3 non-defaulters) — symmetry cancels. The gradient is positive (14.83) because higher incomes are associated with non-default (), so the gradient pushes negative — exactly right. After this one step, the model already knows to decrease as income increases.
Key Formulas Reference
| Step | Formula | Purpose |
|---|---|---|
| Linear score | Raw activation | |
| Sigmoid | Maps probability | |
| Log-odds | Linear interpretation | |
| Per-sample loss | Penalizes wrong confidence | |
| Gradient | Same form as linear regression |
Test Your Understanding
-
The gradient at initialization because the dataset is balanced. If you added one more defaulter (making 4 defaulters, 3 non-defaulters), what sign would have, and what does that mean for 's update?
-
At income=75, the loss is 1.367 — the largest single-sample loss. After one gradient step (), compute the new for income=75 and the new . Did the loss for this sample decrease?
-
The gradient of BCE with respect to weights has the form — identical in structure to the linear regression gradient. Why does the same form emerge from a completely different loss function?
-
At (the approximate fitted weights), the decision boundary (income where ) is at . Compute this boundary income value. Does it match your intuition from the data?
-
The log-odds interpretation says each $1k income multiplies the odds of default by . If income doubles from 50 to 100, what is the ratio of the odds at 100 to the odds at 50?