Back to blog
← View series: machine learning

~/blog

Logistic Regression: Math Intuition

Jun 26, 20268 min readBy Mohammed Vasim
Machine LearningAIData Science

Logistic regression produces probabilities, not class labels. Understanding where those probabilities come from — and why binary cross-entropy is the right loss — requires tracing the full path from raw linear score to gradient update. Every formula here is computed on concrete numbers.

Anchor dataset: Loan default prediction (6 samples for hand-trace clarity).

python
import numpy as np

X = np.array([25, 32, 45, 75, 95, 110]).reshape(-1, 1)
y = np.array([1,   1,  1,  0,  0,   0])

# Weights after fitting (used for trace — not starting weights):
# w₀ = 8.12, w₁ = -0.094

Step 1: Linear Score → Sigmoid → Probability

The raw linear score is computed the same way as in linear regression:

The sigmoid function converts this to a probability:

The model predicts , which is the probability of default given income.

Trace for all 6 samples with , :

IncomePredicted
258.12 − 2.35 = 5.77 = 0.99711 ✓
328.12 − 3.01 = 5.11 = 0.99411 ✓
458.12 − 4.23 = 3.89 = 0.98011 ✓
758.12 − 7.05 = 1.07 = 0.74510 ✗
958.12 − 8.93 = −0.81 = 0.30700 ✓
1108.12 − 10.34 = −2.22 = 0.09800 ✓

At decision threshold 0.5: income = 75 gives — predicted as default (wrong). These weights are illustrative; the true MLE solution would correctly separate this dataset.

z (linear score) σ(z) = P(y=1) <line x1="50" y1="110" x2="500" y2="110" stroke="#e2e8f0" stroke-width="1" stroke-dasharray="4,3"/> <text x="44" y="113" font-size="9" fill="#64748b" text-anchor="end">0.5</text> <line x1="275" y1="15" x2="275" y2="205" stroke="#e2e8f0" stroke-width="1" stroke-dasharray="4,3"/> <text x="275" y="218" font-size="9" fill="#64748b" text-anchor="middle">0</text> <line x1="50" y1="25" x2="500" y2="25" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,3"/> <text x="504" y="28" font-size="9" fill="#ef4444">1</text> <line x1="50" y1="200" x2="500" y2="200" stroke="#ef4444" stroke-width="1" stroke-dasharray="3,3"/> <text x="504" y="203" font-size="9" fill="#ef4444">0</text> <path d="M50,200 C100,200 130,195 160,183 C190,170 210,145 230,120 C248,98 260,70 275,55 C290,40 305,30 330,25 C360,22 400,22 450,21 C470,21 490,21 500,21" fill="none" stroke="#3b82f6" stroke-width="2.5"/> <circle cx="388" cy="21" r="5" fill="#22c55e"/> <text x="388" y="14" font-size="8" fill="#22c55e" text-anchor="middle">25k</text> <circle cx="360" cy="22" r="5" fill="#22c55e"/> <text x="358" y="35" font-size="8" fill="#22c55e" text-anchor="middle">32k</text> <circle cx="308" cy="28" r="5" fill="#22c55e"/> <text x="306" y="41" font-size="8" fill="#22c55e" text-anchor="middle">45k</text> <circle cx="232" cy="73" r="5" fill="#ef4444"/> <text x="240" y="68" font-size="8" fill="#ef4444">75k</text> <circle cx="198" cy="148" r="5" fill="#22c55e"/> <text x="205" y="145" font-size="8" fill="#22c55e">95k</text> <circle cx="164" cy="183" r="5" fill="#22c55e"/> <text x="155" y="178" font-size="8" fill="#22c55e">110k</text> <line x1="275" y1="25" x2="275" y2="205" stroke="#94a3b8" stroke-width="1" stroke-dasharray="2,2"/> <text x="278" y="95" font-size="8" fill="#94a3b8">decision boundary</text> <text x="278" y="106" font-size="8" fill="#94a3b8">σ = 0.5</text>

Green dots are correctly classified; the red dot at income=75k sits above the 0.5 line but is a non-defaulter (y=0). The S-shape ensures all outputs stay within (0, 1).

Step 2: Log-Odds (Logit) Interpretation

The ratio of default probability to non-default probability is the odds:

Taking the log of odds recovers the linear score exactly:

The log-odds (logit) is linear in the features. This means logistic regression is a linear model — it draws a straight boundary in feature space — just applied to log-odds rather than probability directly.

Log-odds trace for 3 anchor samples:

IncomeOdds
250.9970.003332.3
750.7450.2552.92
1100.0980.9020.109

Interpreting : Each additional $1k in income changes the log-odds of default by . In odds terms, it multiplies the odds of default by:

A 9% reduction in the odds of default for each $1k of additional income.

Step 3: Binary Cross-Entropy Loss

Why not MSE? Consider predicting with (correct, very confident). MSE loss = — nearly zero. Now predict with (correct, barely). MSE = . The nearly-random prediction is penalized 250,000× more than the confident correct one — backward.

Binary cross-entropy (BCE) penalizes wrong confidence, not wrong predictions:

Three cases:

  • Correct and confident: , (tiny)
  • Correct and unconfident: , (meaningful signal)
  • Wrong and confident: , (large penalty, )

Per-sample loss for the 6-sample anchor:

Income
2510.997
3210.994
4510.980
7500.745
9500.307
11000.098
Avg loss

The sample at income=75 dominates the loss (1.367) because the model is confidently wrong: it predicts (likely default) for a non-defaulter. This is exactly where gradient descent will push the decision boundary.

Loss when y=1 Loss when y=0 <rect x="10" y="18" width="250" height="180" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <rect x="280" y="18" width="250" height="180" fill="#f8fafc" stroke="#e2e8f0" stroke-width="1"/> <line x1="10" y1="198" x2="260" y2="198" stroke="#334155" stroke-width="1.5"/> <line x1="10" y1="18" x2="10" y2="198" stroke="#334155" stroke-width="1.5"/> <line x1="280" y1="198" x2="530" y2="198" stroke="#334155" stroke-width="1.5"/> <line x1="280" y1="18" x2="280" y2="198" stroke="#334155" stroke-width="1.5"/> <text x="135" y="213" text-anchor="middle" font-size="9" fill="#334155">p (predicted probability)</text> <text x="405" y="213" text-anchor="middle" font-size="9" fill="#334155">p (predicted probability)</text> <text x="10" y="215" font-size="8" fill="#64748b">0</text> <text x="257" y="215" font-size="8" fill="#64748b" text-anchor="end">1</text> <text x="280" y="215" font-size="8" fill="#64748b">0</text> <text x="527" y="215" font-size="8" fill="#64748b" text-anchor="end">1</text> <path d="M12,28 C30,30 50,35 80,50 C120,72 150,100 180,130 C210,158 240,186 258,196" fill="none" stroke="#3b82f6" stroke-width="2"/> <text x="60" y="75" font-size="9" fill="#3b82f6">-log(p)</text> <text x="215" y="188" font-size="8" fill="#22c55e">low loss</text> <text x="15" y="45" font-size="8" fill="#ef4444">high loss →∞</text> <path d="M282,196 C300,186 330,158 360,130 C390,100 420,72 450,50 C480,35 500,30 528,28" fill="none" stroke="#3b82f6" stroke-width="2"/> <text x="430" y="75" font-size="9" fill="#3b82f6">-log(1-p)</text> <text x="284" y="190" font-size="8" fill="#22c55e">low loss</text> <text x="470" y="45" font-size="8" fill="#ef4444">high loss →∞</text>

Left panel (): loss approaches infinity as (confidently wrong). Right panel (): loss approaches infinity as . In both cases, confident correct predictions have loss near zero.

Step 4: Gradient Descent Update

The total cost over samples:

The gradients work out elegantly — the same form as linear regression but with sigmoid probabilities:

One gradient step from , , :

With all weights zero: for every sample, so .

Prediction errors :

Income
250.51−0.5
320.51−0.5
450.51−0.5
750.50+0.5
950.50+0.5
1100.50+0.5

Computing gradients:

Weight updates:

The gradient is zero because the dataset is balanced (3 defaulters, 3 non-defaulters) — symmetry cancels. The gradient is positive (14.83) because higher incomes are associated with non-default (), so the gradient pushes negative — exactly right. After this one step, the model already knows to decrease as income increases.

Key Formulas Reference

StepFormulaPurpose
Linear scoreRaw activation
SigmoidMaps probability
Log-oddsLinear interpretation
Per-sample lossPenalizes wrong confidence
GradientSame form as linear regression

Test Your Understanding

  1. The gradient at initialization because the dataset is balanced. If you added one more defaulter (making 4 defaulters, 3 non-defaulters), what sign would have, and what does that mean for 's update?

  2. At income=75, the loss is 1.367 — the largest single-sample loss. After one gradient step (), compute the new for income=75 and the new . Did the loss for this sample decrease?

  3. The gradient of BCE with respect to weights has the form — identical in structure to the linear regression gradient. Why does the same form emerge from a completely different loss function?

  4. At (the approximate fitted weights), the decision boundary (income where ) is at . Compute this boundary income value. Does it match your intuition from the data?

  5. The log-odds interpretation says each $1k income multiplies the odds of default by . If income doubles from 50 to 100, what is the ratio of the odds at 100 to the odds at 50?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment