← View series: machine learning
~/blog
The Kernel Trick
The kernel trick lets you train an SVM in a space with millions — or infinitely many — dimensions without ever computing a single coordinate in that space. The mechanism is a one-line substitution in the dual objective. Every formula in the dual where appears gets replaced by . The reason this works at all is that the dual depends on dot products between samples, never on the weight vector directly.
Anchor dataset: XOR classification — four points, no linear boundary possible.
import numpy as np
from sklearn.svm import SVC
X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])
# Positive class (y=+1): same-sign quadrants (Q1, Q3)
# Negative class (y=-1): opposite-sign quadrants (Q2, Q4)Phase 1 — Why the Dual Is the Gateway
The primal SVM solves:
The weight vector lives in the same space as the input features. To use a nonlinear feature map , you would need to store and optimize a -dimensional — impossible when (as with the RBF kernel).
The dual objective is:
subject to , .
The weight vector drops out entirely. The dual depends only on pairwise inner products between training points. Substitute the feature map :
A kernel function computes this inner product in the lifted space directly from the original inputs — no explicit computation needed. The prediction for a new point also collapses to kernel evaluations:
The primal needs . The dual needs only the kernel matrix. When — or when — the dual is the only tractable option.
Phase 2 — The Gram Matrix
The full set of pairwise kernel values between all training points is the Gram matrix , where . This is the only structure the SVM optimizer ever sees — not the raw features, not .
The degree-2 homogeneous polynomial kernel corresponds to the feature map . The component captures the sign interaction that separates XOR.
Verifying the feature map works: for and (both positive class):
They map to the same point. For (negative class): . The component flips sign — exactly what separates the classes.
Now computing the full 4×4 Gram matrix. Every entry :
| Pair | ||||
|---|---|---|---|---|
The full matrix (using ):
Same-class pairs (rows 1–2 and rows 3–4) have . Cross-class pairs have . The block structure in the kernel matrix is exactly the class structure.
The SVM optimizer solves the dual using only this matrix. The 3D feature space is never instantiated.
Phase 3 — Prediction via Kernel Evaluations
The dual solution for this symmetric XOR problem: by symmetry all four are equal. Setting and the constraint : ✓.
The dual objective becomes:
Setting the derivative to zero: for all four points.
The bias : substituting (a support vector) into :
Since : .
Now predict (falls in Q1, should be class +1). Compute for each support vector:
| Support vector | (contribution) | ||
|---|---|---|---|
Positive — correctly classified as +1. The votes from and are exactly zero because their sign interaction () is orthogonal to 's orientation.
Phase 4 — Why RBF Is Infinite-Dimensional
For the polynomial kernel, the lifted space is finite: degree-2 maps . The RBF kernel corresponds to an infinite-dimensional space.
For scalar inputs, expand via :
This is a dot product where:
One component for every polynomial degree from 0 to . The feature map is infinite-dimensional. Yet computing costs — the same as an ordinary dot product.
RBF kernel values () on representative anchor pairs:
| Pair | Relationship | ||
|---|---|---|---|
| vs | Same point | ||
| vs | Cross-class, adjacent | ||
| vs | Cross-class, adjacent | ||
| vs | Same class, diagonal |
The same-class pair is further apart than the cross-class pairs . XOR puts same-class points in opposite quadrants. RBF solves it not by finding same-class proximity but by fitting a complex boundary around the Q1/Q3 region.
γ Sensitivity
γ controls how quickly the RBF kernel decays with distance. Small γ: smooth kernel — each point's influence extends far. Large γ: sharp kernel — each point only sees itself.
from sklearn.svm import SVC
import numpy as np
X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])
for gamma in [0.05, 0.5, 2, 20]:
svm = SVC(kernel='rbf', C=10, gamma=gamma)
svm.fit(X, y)
print(f"gamma={gamma:>5}: acc={svm.score(X,y):.2f}, n_SV={len(svm.support_vectors_)}")gamma= 0.05: acc=0.50, n_SV=4
gamma= 0.5: acc=1.00, n_SV=4
gamma= 2: acc=1.00, n_SV=4
gamma= 20: acc=1.00, n_SV=4At : the kernel is so smooth that for all pairs — the optimizer sees every point as equally similar to every other point and cannot find a separating boundary. At : the decay is fast enough to distinguish the XOR structure. At : each point's kernel value drops to near zero within a tiny radius — the SVM draws tight circles around each training point (all 4 become support vectors) and will fail on any test point not close to a training point.
Computational Cost
| Approach | Feature storage | Prediction cost | Feasibility |
|---|---|---|---|
| Explicit (degree-2, ) | floats | Trivial | |
| Explicit (degree-10, ) | floats | Marginal | |
| Explicit (RBF) | dimensions | Impossible | Never |
| Kernel | Gram matrix | Feasible when small |
The kernel wins when . The trap: kernel SVMs compute the full Gram matrix during training. At , that is float64 values — roughly 80 GB. This is why kernel SVMs are replaced by approximations at scale (Nyström method, random Fourier features) once exceeds roughly 50,000.
from sklearn.svm import SVC
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
import numpy as np
X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])
exact = SVC(kernel='rbf', C=10, gamma=0.5)
approx = make_pipeline(
RBFSampler(gamma=0.5, n_components=100, random_state=42),
SGDClassifier(max_iter=1000, random_state=42)
)
exact.fit(X, y)
approx.fit(X, y)
print(f"Exact kernel SVM: {exact.score(X, y):.2f}")
print(f"Approx (RFF, n=100): {approx.score(X, y):.2f}")Exact kernel SVM: 1.00
Approx (RFF, n=100): 1.00RBFSampler approximates the RBF kernel via random Fourier features — maps inputs to a fixed -dimensional space (here ) and then trains a linear model. Same accuracy on this tiny dataset, but the approach scales to millions of samples.
Related Concepts and Honest Limitations
This post depends entirely on the dual SVM formulation from post 02 — specifically that the dual objective collapses out and retains only pairwise inner products . Without the dual, there is no substitution point. From here, the Gram matrix structure generalizes directly: kernel PCA replaces the covariance matrix with , kernel ridge regression solves instead of , and Gaussian processes use as the prior covariance. Mercer's condition (positive semi-definiteness of ) — covered in post 03 — is the formal requirement that makes the substitution valid.
The kernel trick is structurally tied to models whose training objective collapses to dot products — SVMs, kernel ridge regression, kernel PCA. Decision trees, random forests, and gradient boosting have no dual formulation of this form and cannot be kernelized. Logistic regression can be kernelized (via the dual of the regularized problem) but rarely is in practice. At the Gram matrix becomes the bottleneck: 80 GB at in float64, before even running an optimizer. The correct response is the Nyström approximation (subsample points, approximate ) or random Fourier features, both of which trade approximation error for tractable memory. Finally, choosing the wrong kernel is equivalent to choosing the wrong model family: a polynomial kernel of degree 3 on data with a degree-2 structure will overfit; a linear kernel on XOR will fail entirely. Kernel selection is a hyperparameter search problem, not a free choice.
Test Your Understanding
-
The dual objective contains . In the primal, the analogous term is . Show algebraically that these are equal when — this is why the dual value equals the primal value at optimum.
-
The Gram matrix for the XOR anchor with kernel has a perfect block structure. If you added a fifth point (positive class), compute its row in the Gram matrix. Does the block structure hold?
-
At the RBF kernel fails on XOR (50% accuracy). Using the Taylor expansion, explain why small causes the kernel to assign nearly equal similarity to all pairs — what does this do to the Gram matrix?
-
The Nyström approximation builds a rank- approximation of the Gram matrix by selecting "landmark" points. If and , what are the storage requirements for the Nyström approximation vs the full Gram matrix?
-
The polynomial kernel separates XOR perfectly, but the kernel does not (all off-diagonal entries equal 1, giving no class separation). Expand and identify which terms in the feature map are the same for all XOR points — this reveals why the intercept destroys the structure.