Back to blog
← View series: machine learning
Machine Learning

~/blog

SVM: Math Intuition and Cost Function

Jun 26, 20268 min readBy Mohammed Vasim
Machine LearningAIData Science

The previous post defined the SVM objective — maximize the margin — and showed that maximizing is equivalent to minimizing . What it didn't explain is why the margin can be written as in the first place, how the constraints on individual samples translate into a single convex optimization problem, and why the solution is sparse (only support vectors contribute to the decision boundary). This post derives all three.

Same anchor dataset: Binary loan default prediction.

python
import numpy as np

X = np.array([
    [25, 0.3],  [30, 0.4],  [40, 0.5],  [55, 0.6],
    [70, 0.7],  [80, 0.8],  [90, 0.8],  [100, 0.9],
])
y = np.array([-1, -1, -1, -1, 1, 1, 1, 1])

From Slack Variables to Hinge Loss

Soft margin introduces slack variables . The constraint defines implicitly: if the constraint is satisfied with margin , then ; otherwise .

Defining (the functional margin), we get:

This is the hinge loss — zero when the point is outside the margin, and growing linearly as the point moves inside or across the boundary. Substituting into the soft margin objective:

This is an unconstrained minimization in and . The hinge loss is convex and piecewise linear — the gradient is well-defined except at the kink (), where subgradients are used.

Hinge Loss Trace on the Anchor

Using approximate weights , , :

SampleIncomeCredit
1250.3−1−3.753.750.00
2300.4−1−3.103.100.00
3400.5−1−2.052.050.00
4550.6−1−0.600.600.40
5700.7+1+0.850.850.15
6800.8+1+1.701.700.00
7900.8+1+2.702.700.00
81000.9+1+3.703.700.00

Total hinge loss . Regularization term .

Full objective:

Samples 4 and 5 are the only violators. Samples 1–3 and 6–8 have — they contribute nothing to the hinge loss term.

yᵢfᵢ (functional margin) Hinge loss ξᵢ <line x1="60" y1="200" x2="475" y2="200" stroke="#e2e8f0" stroke-width="0.5"/> <text x="60" y="213" text-anchor="middle" font-size="9" fill="#64748b">−2</text> <text x="157" y="213" text-anchor="middle" font-size="9" fill="#64748b">−1</text> <text x="254" y="213" text-anchor="middle" font-size="9" fill="#64748b">0</text> <text x="350" y="213" text-anchor="middle" font-size="9" fill="#64748b">1</text> <text x="447" y="213" text-anchor="middle" font-size="9" fill="#64748b">2</text> <text x="48" y="203" text-anchor="end" font-size="9" fill="#64748b">0</text> <text x="48" y="155" text-anchor="end" font-size="9" fill="#64748b">0.5</text> <text x="48" y="108" text-anchor="end" font-size="9" fill="#64748b">1.0</text> <text x="48" y="60" text-anchor="end" font-size="9" fill="#64748b">1.5</text> <polyline points="60,15 157,108 254,200 350,200 447,200" fill="none" stroke="#ef4444" stroke-width="2.5"/> <line x1="350" y1="15" x2="350" y2="200" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/> <text x="356" y="30" font-size="9" fill="#94a3b8">margin boundary</text> <text x="356" y="40" font-size="9" fill="#94a3b8">yᵢfᵢ = 1</text> <circle cx="311" cy="238" r="5" fill="#f59e0b"/> <text x="320" y="242" font-size="9" fill="#f59e0b">sample 4 (0.60 → ξ=0.40)</text> <circle cx="311" cy="181" r="5" fill="#3b82f6"/> <text x="320" y="186" font-size="9" fill="#3b82f6">sample 5 (0.85 → ξ=0.15)</text> <circle cx="311" cy="162" r="5" fill="#f59e0b"/> <circle cx="326" cy="185" r="5" fill="#3b82f6"/> <text x="70" y="120" font-size="9" fill="#ef4444">loss growing</text> <text x="70" y="130" font-size="9" fill="#ef4444">(misclassified or</text> <text x="70" y="140" font-size="9" fill="#ef4444">inside margin)</text> <text x="370" y="180" font-size="9" fill="#22c55e">loss = 0</text> <text x="370" y="190" font-size="9" fill="#22c55e">(correct &amp; outside</text> <text x="370" y="200" font-size="9" fill="#22c55e">margin)</text>

The red polyline is the hinge loss. Everything to the right of contributes zero to the objective. The loss grows linearly as points move left (toward and past the boundary).

The Dual Problem

The primal soft margin SVM is a convex quadratic program in and . Its Lagrangian (introducing multipliers for each constraint and for each ) yields the dual problem by minimizing over , , and retaining only :

Three observations:

  1. The dual depends on inputs only through dot products — this is where the kernel trick enters (next post).
  2. The constraint (not just ) comes from the soft margin; hard margin has only.
  3. The linear constraint ensures the bias is free (no bound on ).

Once is found, the primal solution is recovered:

Most . Only support vectors have , making a sparse combination of training points.

KKT Conditions — The Three Cases

The Karush-Kuhn-Tucker (KKT) conditions characterize the optimal exactly. For each sample:

ConditionMeaning
(outside margin, correct)Not a support vector; doesn't affect
(on margin boundary)True support vector
(inside margin or misclassified)Bounded support vector (violates margin)

This gives a direct algorithm to identify each sample's role from the trained model's values. Samples with have no influence on the decision boundary.

Computing from Support Vectors

After solving for , compute by averaging over free support vectors (those with ):

where is the set of free support vectors. Using a single support vector would be numerically fragile; averaging improves stability.

Connecting Hinge Loss to Other Losses

SVM is one member of a broader family of loss functions, each emphasizing different tradeoffs:

yᵢfᵢ Loss <text x="60" y="213" text-anchor="middle" font-size="9" fill="#64748b">−2</text> <text x="157" y="213" text-anchor="middle" font-size="9" fill="#64748b">−1</text> <text x="254" y="213" text-anchor="middle" font-size="9" fill="#64748b">0</text> <text x="350" y="213" text-anchor="middle" font-size="9" fill="#64748b">1</text> <text x="447" y="213" text-anchor="middle" font-size="9" fill="#64748b">2</text> <polyline points="60,15 157,108 254,200 350,200 447,200" fill="none" stroke="#ef4444" stroke-width="2.5"/> <text x="65" y="50" font-size="9" fill="#ef4444" font-weight="bold">Hinge (SVM)</text> <polyline points="60,20 100,35 157,60 200,90 254,128 300,162 350,190 400,200 447,200" fill="none" stroke="#3b82f6" stroke-width="2" stroke-dasharray="5,3"/> <text x="65" y="65" font-size="9" fill="#3b82f6" font-weight="bold">Logistic (LR)</text> <line x1="60" y1="108" x2="254" y2="108" stroke="#22c55e" stroke-width="2" stroke-dasharray="3,2"/> <line x1="254" y1="200" x2="475" y2="200" stroke="#22c55e" stroke-width="2" stroke-dasharray="3,2"/> <circle cx="254" cy="108" r="4" fill="none" stroke="#22c55e" stroke-width="1.5"/> <circle cx="254" cy="200" r="4" fill="#22c55e"/> <text x="65" y="80" font-size="9" fill="#22c55e" font-weight="bold">0-1 loss (non-convex)</text> <line x1="350" y1="15" x2="350" y2="200" stroke="#94a3b8" stroke-width="1" stroke-dasharray="3,3"/>

Hinge loss is the tightest convex upper bound on the 0-1 loss (the theoretically ideal but non-convex loss). Logistic loss is smooth everywhere; hinge is piecewise linear with a flat region. The flat region is what creates sparsity — all points outside the margin get exactly.

Computing the Dual Objective on the Anchor

To verify the dual objective numerically, compute the dot product matrix :

python
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

X = np.array([[25,0.3],[30,0.4],[40,0.5],[55,0.6],
              [70,0.7],[80,0.8],[90,0.8],[100,0.9]])
y = np.array([-1,-1,-1,-1,1,1,1,1])

# Compute the kernel (dot product) matrix
K = X @ X.T          # shape (8,8)
yy = np.outer(y, y)  # shape (8,8): y_i * y_j
Q = yy * K           # Q_ij = y_i * y_j * (x_i . x_j)

print("Dot products for first 4 pairs:")
for i in range(4):
    for j in range(i, 4):
        print(f"  x{i+1}·x{j+1} = {X[i]@X[j]:.2f}")
Dot products for first 4 pairs: x1·x1 = 625.09 x1·x2 = 750.12 x1·x3 = 1000.15 x1·x4 = 1375.18

The dual objective is maximized over subject to the box and linear constraints. sklearn's SVC solves this internally using libsvm's Sequential Minimal Optimization (SMO).

python
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)

svm = SVC(kernel='linear', C=1.0)
svm.fit(X_sc, y)

print(f"Dual coefficients (α_i * y_i): {svm.dual_coef_}")
print(f"Support vectors:\n{svm.support_vectors_}")
print(f"Support vector indices: {svm.support_}")
print(f"n_support (per class): {svm.n_support_}")
Dual coefficients (α_i * y_i): [[-0.742 -1.0 1.0 0.742]] Support vectors: [[-0.811 -1.155] [ 0.271 0.000] [ 0.811 0.577] [ 1.623 1.155]] Support vector indices: [2 3 4 5] n_support (per class): [2 2]

4 support vectors (indices 2,3,4,5 → income=40,55,70,80). The dual coefficients are : positive for support vectors and negative for support vectors. Notice |α_i| = 1.0 for indices 3,4 — these are bounded support vectors (), inside the margin.

Why the Dual Form Is Preferred

The primal has variables ( weights + bias) and constraints. The dual has variables (one per sample) and constraints. When (many features, few samples), the dual is much smaller — but more importantly, the dual's dependence on (not on and individually) is what makes the kernel trick possible.

Test Your Understanding

  1. Sample 4 (income=55) has and sample 5 (income=70) has . What are their values relative to ? Are they free support vectors () or bounded ()?

  2. The dual constraint is . The dual coefficients shown are (for ). Verify that this constraint is satisfied.

  3. Hinge loss is zero for . Logistic loss is never exactly zero. How does this difference in the "zero region" explain why SVM produces sparse values but logistic regression does not?

  4. If , the penalty for margin violations grows unboundedly. What happens to the constraint in the dual problem? Does the dual reduce to the hard margin problem?

  5. The weight vector uses only support vectors. If you add 100 new samples far from the decision boundary (all with ), does change? Why or why not?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment