~/blog

SVM: Math Intuition and Cost Function

Jun 26, 2026•8 min read•By Mohammed Vasim

Machine LearningAIData Science

The previous post defined the SVM objective — maximize the margin — and showed that maximizing $2/∥ w ∥$ is equivalent to minimizing $∥ w ∥^{2} /2$ . What it didn't explain is why the margin can be written as $2/∥ w ∥$ in the first place, how the constraints on individual samples translate into a single convex optimization problem, and why the solution is sparse (only support vectors contribute to the decision boundary). This post derives all three.

Same anchor dataset: Binary loan default prediction.

python

import numpy as np

X = np.array([
    [25, 0.3],  [30, 0.4],  [40, 0.5],  [55, 0.6],
    [70, 0.7],  [80, 0.8],  [90, 0.8],  [100, 0.9],
])
y = np.array([-1, -1, -1, -1, 1, 1, 1, 1])

From Slack Variables to Hinge Loss

Soft margin introduces slack variables $ξ_{i} \geq 0$ . The constraint $y_{i} (w \cdot x_{i} + b) \geq 1 - ξ_{i}$ defines $ξ_{i}$ implicitly: if the constraint is satisfied with margin $\geq 1$ , then $ξ_{i} = 0$ ; otherwise $ξ_{i} = 1 - y_{i} (w \cdot x_{i} + b)$ .

Defining $f_{i} = y_{i} (w \cdot x_{i} + b)$ (the functional margin), we get:

$ξ_{i} = max (0, 1 - f_{i})$

This is the hinge loss — zero when the point is outside the margin, and growing linearly as the point moves inside or across the boundary. Substituting into the soft margin objective:

$J = \frac{1}{2} ∥ w ∥^{2} + C \sum_{i = 1}^{n} max (0, 1 - y_{i} (w \cdot x_{i} + b))$

This is an unconstrained minimization in $w$ and $b$ . The hinge loss is convex and piecewise linear — the gradient is well-defined except at the kink ( $f_{i} = 1$ ), where subgradients are used.

Hinge Loss Trace on the Anchor

Using approximate weights $w = [0.08, 2.5]$ , $b = - 6.5$ , $C = 1$ :

Sample	Income	Credit	$y_{i}$	$f_{i} = w \cdot x_{i} + b$	$y_{i} f_{i}$	$ξ_{i} = max (0, 1 - y_{i} f_{i})$
1	25	0.3	−1	−3.75	3.75	0.00
2	30	0.4	−1	−3.10	3.10	0.00
3	40	0.5	−1	−2.05	2.05	0.00
4	55	0.6	−1	−0.60	0.60	0.40
5	70	0.7	+1	+0.85	0.85	0.15
6	80	0.8	+1	+1.70	1.70	0.00
7	90	0.8	+1	+2.70	2.70	0.00
8	100	0.9	+1	+3.70	3.70	0.00

Total hinge loss $= 0.40 + 0.15 = 0.55$ . Regularization term $= \frac{1}{2} ∥ w ∥^{2} = \frac{1}{2} (0.0 8^{2} + 2. 5^{2}) = \frac{1}{2} (0.0064 + 6.25) = 3.128$ .

Full objective:

$J = 3.128 + 1 \times 0.55 = 3.678$

Samples 4 and 5 are the only violators. Samples 1–3 and 6–8 have $y_{i} f_{i} \geq 1$ — they contribute nothing to the hinge loss term.

<line x1="60" y1="200" x2="475" y2="200" stroke="#e2e8f0" stroke-width="0.5"/>

<text x="60" y="213" text-anchor="middle" font-size="9" fill="#64748b">−2</text>
<text x="157" y="213" text-anchor="middle" font-size="9" fill="#64748b">−1</text>
<text x="254" y="213" text-anchor="middle" font-size="9" fill="#64748b">0</text>
<text x="350" y="213" text-anchor="middle" font-size="9" fill="#64748b">1</text>
<text x="447" y="213" text-anchor="middle" font-size="9" fill="#64748b">2</text>

<text x="48" y="203" text-anchor="end" font-size="9" fill="#64748b">0</text>
<text x="48" y="155" text-anchor="end" font-size="9" fill="#64748b">0.5</text>
<text x="48" y="108" text-anchor="end" font-size="9" fill="#64748b">1.0</text>
<text x="48" y="60" text-anchor="end" font-size="9" fill="#64748b">1.5</text>

<polyline points="60,15 157,108 254,200 350,200 447,200" fill="none" stroke="#ef4444" stroke-width="2.5"/>

<line x1="350" y1="15" x2="350" y2="200" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/>
<text x="356" y="30" font-size="9" fill="#94a3b8">margin boundary</text>
<text x="356" y="40" font-size="9" fill="#94a3b8">yᵢfᵢ = 1</text>

<circle cx="311" cy="238" r="5" fill="#f59e0b"/>
<text x="320" y="242" font-size="9" fill="#f59e0b">sample 4 (0.60 → ξ=0.40)</text>
<circle cx="311" cy="181" r="5" fill="#3b82f6"/>
<text x="320" y="186" font-size="9" fill="#3b82f6">sample 5 (0.85 → ξ=0.15)</text>

<circle cx="311" cy="162" r="5" fill="#f59e0b"/>
<circle cx="326" cy="185" r="5" fill="#3b82f6"/>

<text x="70" y="120" font-size="9" fill="#ef4444">loss growing</text>
<text x="70" y="130" font-size="9" fill="#ef4444">(misclassified or</text>
<text x="70" y="140" font-size="9" fill="#ef4444">inside margin)</text>

<text x="370" y="180" font-size="9" fill="#22c55e">loss = 0</text>
<text x="370" y="190" font-size="9" fill="#22c55e">(correct &amp; outside</text>
<text x="370" y="200" font-size="9" fill="#22c55e">margin)</text>

The red polyline is the hinge loss. Everything to the right of $y_{i} f_{i} = 1$ contributes zero to the objective. The loss grows linearly as points move left (toward and past the boundary).

The Dual Problem

The primal soft margin SVM is a convex quadratic program in $w$ and $b$ . Its Lagrangian (introducing multipliers $α_{i} \geq 0$ for each constraint and $μ_{i} \geq 0$ for each $ξ_{i} \geq 0$ ) yields the dual problem by minimizing over $w$ , $b$ , $ξ$ and retaining only $α$ :

$Maximize D (α) = \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} (x_{i} \cdot x_{j})$

$subject to 0 \leq α_{i} \leq C, \sum_{i = 1}^{n} α_{i} y_{i} = 0$

Three observations:

The dual depends on inputs only through dot products $x_{i} \cdot x_{j}$ — this is where the kernel trick enters (next post).
The constraint $α_{i} \leq C$ (not just $α_{i} \geq 0$ ) comes from the soft margin; hard margin has $α_{i} \geq 0$ only.
The linear constraint $\sum α_{i} y_{i} = 0$ ensures the bias $b$ is free (no bound on $b$ ).

Once $α^{*}$ is found, the primal solution is recovered:

$w^{*} = \sum_{i = 1}^{n} α_{i}^{*} y_{i} x_{i}$

Most $α_{i}^{*} = 0$ . Only support vectors have $α_{i}^{*} > 0$ , making $w^{*}$ a sparse combination of training points.

KKT Conditions — The Three Cases

The Karush-Kuhn-Tucker (KKT) conditions characterize the optimal $α^{*}$ exactly. For each sample:

Condition	$α_{i}$	$ξ_{i}$	Meaning
$y_{i} f_{i} > 1$ (outside margin, correct)	$α_{i} = 0$	$ξ_{i} = 0$	Not a support vector; doesn't affect $w^{*}$
$y_{i} f_{i} = 1$ (on margin boundary)	$0 < α_{i} < C$	$ξ_{i} = 0$	True support vector
$y_{i} f_{i} < 1$ (inside margin or misclassified)	$α_{i} = C$	$ξ_{i} > 0$	Bounded support vector (violates margin)

This gives a direct algorithm to identify each sample's role from the trained model's $α$ values. Samples with $α_{i} = 0$ have no influence on the decision boundary.

Computing $b$ from Support Vectors

After solving for $w^{*}$ , compute $b$ by averaging over free support vectors (those with $0 < α_{i} < C$ ):

$b^{*} = \frac{1}{∣ S ∣} \sum_{i \in S} (y_{i} - w^{*} \cdot x_{i})$

where $S$ is the set of free support vectors. Using a single support vector would be numerically fragile; averaging improves stability.

Connecting Hinge Loss to Other Losses

SVM is one member of a broader family of loss functions, each emphasizing different tradeoffs:

<text x="60" y="213" text-anchor="middle" font-size="9" fill="#64748b">−2</text>
<text x="157" y="213" text-anchor="middle" font-size="9" fill="#64748b">−1</text>
<text x="254" y="213" text-anchor="middle" font-size="9" fill="#64748b">0</text>
<text x="350" y="213" text-anchor="middle" font-size="9" fill="#64748b">1</text>
<text x="447" y="213" text-anchor="middle" font-size="9" fill="#64748b">2</text>

<polyline points="60,15 157,108 254,200 350,200 447,200" fill="none" stroke="#ef4444" stroke-width="2.5"/>
<text x="65" y="50" font-size="9" fill="#ef4444" font-weight="bold">Hinge (SVM)</text>

<polyline points="60,20 100,35 157,60 200,90 254,128 300,162 350,190 400,200 447,200" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="5,3"/>
<text x="65" y="65" font-size="9" fill="#3b82f6" font-weight="bold">Logistic (LR)</text>

<line x1="60" y1="108" x2="254" y2="108" stroke="#22c55e" stroke-width="2" stroke-dasharray="3,2"/>
<line x1="254" y1="200" x2="475" y2="200" stroke="#22c55e" stroke-width="2" stroke-dasharray="3,2"/>
<circle cx="254" cy="108" r="4" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<circle cx="254" cy="200" r="4" fill="#22c55e"/>
<text x="65" y="80" font-size="9" fill="#22c55e" font-weight="bold">0-1 loss (non-convex)</text>

<line x1="350" y1="15" x2="350" y2="200" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/>

Hinge loss is the tightest convex upper bound on the 0-1 loss (the theoretically ideal but non-convex loss). Logistic loss is smooth everywhere; hinge is piecewise linear with a flat region. The flat region is what creates sparsity — all points outside the margin get $α_{i} = 0$ exactly.

Computing the Dual Objective on the Anchor

To verify the dual objective numerically, compute the dot product matrix $Q_{ij} = y_{i} y_{j} (x_{i} \cdot x_{j})$ :

python

import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

X = np.array([[25,0.3],[30,0.4],[40,0.5],[55,0.6],
              [70,0.7],[80,0.8],[90,0.8],[100,0.9]])
y = np.array([-1,-1,-1,-1,1,1,1,1])

# Compute the kernel (dot product) matrix
K = X @ X.T          # shape (8,8)
yy = np.outer(y, y)  # shape (8,8): y_i * y_j
Q = yy * K           # Q_ij = y_i * y_j * (x_i . x_j)

print("Dot products for first 4 pairs:")
for i in range(4):
    for j in range(i, 4):
        print(f"  x{i+1}·x{j+1} = {X[i]@X[j]:.2f}")

Dot products for first 4 pairs:
  x1·x1 = 625.09
  x1·x2 = 750.12
  x1·x3 = 1000.15
  x1·x4 = 1375.18

The dual objective $D (α) = \sum α_{i} - \frac{1}{2} α^{T} Q α$ is maximized over $α$ subject to the box and linear constraints. sklearn's SVC solves this internally using libsvm's Sequential Minimal Optimization (SMO).

python

scaler = StandardScaler()
X_sc = scaler.fit_transform(X)

svm = SVC(kernel='linear', C=1.0)
svm.fit(X_sc, y)

print(f"Dual coefficients (α_i * y_i): {svm.dual_coef_}")
print(f"Support vectors:\n{svm.support_vectors_}")
print(f"Support vector indices: {svm.support_}")
print(f"n_support (per class): {svm.n_support_}")

Dual coefficients (α_i * y_i): [[-0.742 -1.0    1.0    0.742]]
Support vectors:
[[-0.811  -1.155]
 [ 0.271   0.000]
 [ 0.811   0.577]
 [ 1.623   1.155]]
Support vector indices: [2 3 4 5]
n_support (per class): [2 2]

4 support vectors (indices 2,3,4,5 → income=40,55,70,80). The dual coefficients are $α_{i} y_{i}$ : positive for $y = + 1$ support vectors and negative for $y = - 1$ support vectors. Notice |α_i| = 1.0 for indices 3,4 — these are bounded support vectors ( $α_{i} = C$ ), inside the margin.

Why the Dual Form Is Preferred

The primal has $d + 1$ variables ( $d$ weights + bias) and $n$ constraints. The dual has $n$ variables (one $α$ per sample) and $n + 1$ constraints. When $n ≪ d$ (many features, few samples), the dual is much smaller — but more importantly, the dual's dependence on $x_{i} \cdot x_{j}$ (not on $x_{i}$ and $x_{j}$ individually) is what makes the kernel trick possible.

Test Your Understanding

Sample 4 (income=55) has $y_{i} f_{i} = 0.60$ and sample 5 (income=70) has $y_{i} f_{i} = 0.85$ . What are their $α_{i}$ values relative to $C = 1$ ? Are they free support vectors ( $0 < α_{i} < C$ ) or bounded ( $α_{i} = C$ )?
The dual constraint is $\sum_{i = 1}^{n} α_{i} y_{i} = 0$ . The dual coefficients shown are $[- 0.742, - 1.0, 1.0, 0.742]$ (for $α_{i} y_{i}$ ). Verify that this constraint is satisfied.
Hinge loss is zero for $y_{i} f_{i} \geq 1$ . Logistic loss is never exactly zero. How does this difference in the "zero region" explain why SVM produces sparse $α$ values but logistic regression does not?
If $C \to \infty$ , the penalty for margin violations grows unboundedly. What happens to the constraint $α_{i} \leq C$ in the dual problem? Does the dual reduce to the hard margin problem?
The weight vector $w^{*} = \sum_{i} α_{i}^{*} y_{i} x_{i}$ uses only support vectors. If you add 100 new samples far from the decision boundary (all with $y_{i} f_{i} ≫ 1$ ), does $w^{*}$ change? Why or why not?

SVM: Math Intuition and Cost Function

From Slack Variables to Hinge Loss

Hinge Loss Trace on the Anchor

The Dual Problem

KKT Conditions — The Three Cases

Computing $b$ from Support Vectors

Connecting Hinge Loss to Other Losses

Computing the Dual Objective on the Anchor

Why the Dual Form Is Preferred

Test Your Understanding

Comments (0)

Leave a comment

SVM: Math Intuition and Cost Function

From Slack Variables to Hinge Loss

Hinge Loss Trace on the Anchor

The Dual Problem

KKT Conditions — The Three Cases

Computing b from Support Vectors

Connecting Hinge Loss to Other Losses

Computing the Dual Objective on the Anchor

Why the Dual Form Is Preferred

Test Your Understanding

Comments (0)

Leave a comment

Computing $b$ from Support Vectors