← View series: machine learning
~/blog
SVM Kernels
The linear SVM finds the maximum-margin hyperplane in the original feature space. When classes aren't linearly separable — like the XOR pattern — no hyperplane works, no matter how wide a margin you allow. The kernel trick solves this without ever explicitly computing a high-dimensional transformation.
Anchor dataset: XOR classification. Four points in the positive class, four in the negative class, arranged so no line can separate them.
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
X = np.array([
[ 1, 1], [ 2, 2], [-1, -1], [-2, -2], # positive class: same sign
[ 1, -1], [ 2, -2], [-1, 1], [-2, 2], # negative class: opposite sign
])
y = np.array([1, 1, 1, 1, -1, -1, -1, -1])
svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X, y)
train_acc = svm_linear.score(X, y)
print(f"Linear SVM train accuracy on XOR: {train_acc:.2f}")Linear SVM train accuracy on XOR: 0.50
Linear SVM gets only 50% accuracy — equivalent to random guessing. The XOR pattern requires a nonlinear boundary.
Why Linear Fails on XOR
<text x="265" y="270" text-anchor="middle" font-size="11" fill="#334155">x₁</text>
<text x="25" y="140" text-anchor="middle" font-size="11" fill="#334155" transform="rotate(-90,25,140)">x₂</text>
<circle cx="308" cy="77" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<circle cx="366" cy="39" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<circle cx="192" cy="193" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<circle cx="134" cy="231" r="8" fill="#3b82f6" stroke="#2563eb" stroke-width="1.5"/>
<text x="380" y="30" font-size="9" fill="#3b82f6">y=+1</text>
<rect x="300" y="185" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<rect x="358" y="222" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<rect x="184" y="65" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<rect x="126" y="27" width="14" height="14" fill="#ef4444" stroke="#dc2626" stroke-width="1.5"/>
<text x="375" y="220" font-size="9" fill="#ef4444">y=−1</text>
<text x="320" y="220" font-size="9" fill="#64748b">(1,−1)</text>
<text x="320" y="70" font-size="9" fill="#64748b">(1,1)</text>
<text x="140" y="220" font-size="9" fill="#64748b">(−1,1)</text>
<text x="140" y="70" font-size="9" fill="#64748b">(−1,−1)</text>
<text x="68" y="160" font-size="8" fill="#94a3b8">Positive class: same sign quadrants</text>
<text x="68" y="172" font-size="8" fill="#94a3b8">Negative class: opposite sign quadrants</text>
Positive class sits in Q1 and Q3 (same-sign quadrants); negative class in Q2 and Q4. Any line that separates Q1 from Q4 fails to separate Q3 from Q2 — the pattern wraps around.
The Feature Map Solution
Instead of drawing a line in 2D, map to a 3D space where the classes become linearly separable. The feature map :
The third component captures the sign interaction — positive when have the same sign, negative otherwise.
| Point | |||||
|---|---|---|---|---|---|
| (1,1) | +1 | +1 | 1 | 1 | +1 |
| (2,2) | +1 | +1 | 4 | 4 | +4 |
| (−1,−1) | +1 | +1 | 1 | 1 | +1 |
| (−2,−2) | +1 | +1 | 4 | 4 | +4 |
| (1,−1) | −1 | −1 | 1 | 1 | −1 |
| (2,−2) | −1 | −1 | 4 | 4 | −4 |
| (−1,1) | −1 | −1 | 1 | 1 | −1 |
| (−2,2) | −1 | −1 | 4 | 4 | −4 |
In the lifted space, all positive class points have and all negative class points have . The hyperplane (i.e., ) separates them perfectly.
The Kernel Trick
Computing explicitly then taking dot products is expensive — the mapped space can be millions of dimensions. The kernel trick shows that what appears in the dual objective is only the dot product , and this inner product can often be computed without ever computing explicitly.
A kernel function is:
computed directly in the original space. For the polynomial kernel of degree 2:
Expanding:
This equals for the feature map — without ever computing the 6-dimensional vector.
Polynomial Kernel Values on the Anchor
Computing for representative pairs:
| Pair | Same class? | ||
|---|---|---|---|
| (1,1) vs (2,2) | Yes (+1) | ||
| (1,1) vs (1,−1) | No | ||
| (1,1) vs (−1,−1) | Yes (+1) | ||
| (1,−1) vs (2,−2) | Yes (−1) | ||
| (1,1) vs (−2,2) | No |
Same-class pairs share high kernel values (25) when features have the same magnitude. Cross-class pairs get — the offset from the constant term only, no sign interaction.
The RBF Kernel
The Radial Basis Function (Gaussian) kernel is the most commonly used:
- controls how quickly similarity decays with distance
- As : all points are equally similar (underfitting)
- As : only identical points are similar (overfitting)
- RBF corresponds to an infinite-dimensional feature map — it can represent any smooth decision boundary
RBF kernel values for on the anchor:
| Pair | Same class? | ||
|---|---|---|---|
| (1,1) vs (2,2) | Yes (+1) | ||
| (1,1) vs (1,−1) | No | ||
| (1,1) vs (−1,−1) | Yes (+1) |
The (1,1) vs (−1,−1) pair gets — even though they're in the same class, they're far apart in distance. The kernel doesn't know class labels; it only measures similarity by proximity.
Effect on the Decision Boundary
import matplotlib
matplotlib.use('Agg')
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
X = np.array([[1,1],[2,2],[-1,-1],[-2,-2],[1,-1],[2,-2],[-1,1],[-2,2]])
y = np.array([1,1,1,1,-1,-1,-1,-1])
gammas = [0.1, 1, 10, 100]
for gamma in gammas:
svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
svm.fit(X, y)
train_acc = svm.score(X, y)
print(f"gamma={gamma:>5}: train_acc={train_acc:.2f}, n_SV={svm.support_vectors_.shape[0]}")gamma= 0.1: train_acc=0.50, n_SV=8
gamma= 1: train_acc=1.00, n_SV=4
gamma= 10: train_acc=1.00, n_SV=8
gamma= 100: train_acc=1.00, n_SV=8
At : still fails — the RBF kernel is too smooth (each point's influence extends very far, averaging out the XOR structure). At : perfectly separates — four support vectors, tight boundary. At : still 100% train accuracy but all 8 points become support vectors — the kernel is so local that each point only "sees" itself.
<rect x="10" y="16" width="130" height="145" fill="#fff0f0" rx="3"/>
<rect x="165" y="16" width="130" height="145" fill="#f0fff4" rx="3"/>
<rect x="325" y="16" width="130" height="145" fill="#f0fff4" rx="3"/>
<rect x="480" y="16" width="130" height="145" fill="#f0fff4" rx="3"/>
<line x1="10" y1="88" x2="140" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="75" y1="16" x2="75" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="165" y1="88" x2="295" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="230" y1="16" x2="230" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="325" y1="88" x2="455" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="390" y1="16" x2="390" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="480" y1="88" x2="610" y2="88" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="545" y1="16" x2="545" y2="161" stroke="#cbd5e1" stroke-width="0.8"/>
<line x1="10" y1="88" x2="140" y2="88" stroke="#ef4444" stroke-width="2"/>
<text x="75" y="130" text-anchor="middle" font-size="8" fill="#ef4444">straight line</text>
<text x="75" y="140" text-anchor="middle" font-size="8" fill="#ef4444">50% accuracy</text>
<path d="M 165,88 Q 210,30 230,88 Q 250,146 295,88" fill="none" stroke="#22c55e" stroke-width="2"/>
<path d="M 165,88 Q 210,146 230,88 Q 250,30 295,88" fill="none" stroke="#22c55e" stroke-width="2"/>
<text x="230" y="152" text-anchor="middle" font-size="8" fill="#22c55e">X-shaped boundary</text>
<text x="230" y="162" text-anchor="middle" font-size="8" fill="#22c55e">100% accuracy, 4 SV</text>
<ellipse cx="355" cy="65" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<ellipse cx="425" cy="111" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<ellipse cx="355" cy="111" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<ellipse cx="425" cy="65" rx="22" ry="18" fill="none" stroke="#22c55e" stroke-width="1.5"/>
<text x="390" y="152" text-anchor="middle" font-size="8" fill="#22c55e">100% accuracy, 8 SV</text>
<text x="390" y="162" text-anchor="middle" font-size="8" fill="#22c55e">(overfit in prod)</text>
<ellipse cx="510" cy="58" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<ellipse cx="580" cy="118" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<ellipse cx="510" cy="118" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<ellipse cx="580" cy="58" rx="14" ry="12" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
<text x="545" y="152" text-anchor="middle" font-size="8" fill="#f59e0b">100% train, but</text>
<text x="545" y="162" text-anchor="middle" font-size="8" fill="#f59e0b">circles each point</text>
<circle cx="100" cy="63" r="5" fill="#3b82f6"/>
<circle cx="120" cy="43" r="5" fill="#3b82f6"/>
<circle cx="50" cy="113" r="5" fill="#3b82f6"/>
<circle cx="30" cy="133" r="5" fill="#3b82f6"/>
<rect x="95" y="108" width="9" height="9" fill="#ef4444"/>
<rect x="115" y="128" width="9" height="9" fill="#ef4444"/>
<rect x="45" y="58" width="9" height="9" fill="#ef4444"/>
<rect x="25" y="38" width="9" height="9" fill="#ef4444"/>
Choosing a Kernel
| Kernel | Formula | When to use |
|---|---|---|
| Linear | High-dimensional data, text, when | |
| Polynomial (deg ) | Image patches, polynomial relationships | |
| RBF (Gaussian) | Default choice; works when decision boundary is complex | |
| Sigmoid | Rarely used; not always a valid kernel |
RBF is the default for unknown structure. Linear wins when the data is already high-dimensional (text BOW, embeddings) because: (a) it's often linearly separable in high dimensions already, (b) the dual has variables but the primal has — using the primal (with LinearSVC) is faster.
Mercer's Theorem
A function is a valid kernel (corresponds to some feature map ) if and only if the kernel matrix is positive semi-definite for any set of points. This is Mercer's condition. The sigmoid kernel violates this for some parameter choices — SVM convergence is not guaranteed when using non-PSD kernels.
Test Your Understanding
-
The feature map separates XOR. What hyperplane in the 3D lifted space corresponds to the decision boundary ? Express it in terms of .
-
The polynomial kernel computes a 6-dimensional dot product without explicitly computing 6-dimensional vectors. What is the computational cost of computing in the original 2D space? What would it cost to explicitly compute in 6D?
-
At , the RBF kernel fails on XOR (50% accuracy). Explain geometrically: if the RBF kernel is very smooth (large influence radius), what does the learned decision function look like?
-
At , all 8 training samples become support vectors. This achieves 100% train accuracy but is likely to generalize poorly. Given the KKT conditions from post 02, what does it mean when every point is a support vector?
-
The linear kernel SVC fails on XOR but would succeed on the lifted XOR data (with as a third feature). If you add as a feature and use a linear kernel, would the result be identical to using an RBF kernel with on the original data? Why or why not?