Back to blog
← View series: machine learning
Machine Learning

~/blog

The Kernel Trick

Jun 27, 202613 min readBy Mohammed Vasim
Machine LearningAIData Science

The kernel trick lets you train an SVM in a space with millions — or infinitely many — dimensions without ever computing a single coordinate in that space. The mechanism is a one-line substitution in the dual objective. Every formula in the dual where appears gets replaced by . The reason this works at all is that the dual depends on dot products between samples, never on the weight vector directly.

Anchor dataset: XOR classification — four points, no linear boundary possible.

python
import numpy as np
from sklearn.svm import SVC

X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])
# Positive class (y=+1): same-sign quadrants (Q1, Q3)
# Negative class (y=-1): opposite-sign quadrants (Q2, Q4)

Phase 1 — Why the Dual Is the Gateway

The primal SVM solves:

The weight vector lives in the same space as the input features. To use a nonlinear feature map , you would need to store and optimize a -dimensional — impossible when (as with the RBF kernel).

The dual objective is:

subject to , .

The weight vector drops out entirely. The dual depends only on pairwise inner products between training points. Substitute the feature map :

A kernel function computes this inner product in the lifted space directly from the original inputs — no explicit computation needed. The prediction for a new point also collapses to kernel evaluations:

Dual (primal input space) max Σ αᵢ − ½ Σᵢ Σⱼ αᵢ αⱼ yᵢ yⱼ xᵢ · xⱼ ← only dot products substitute K(xᵢ, xⱼ) = φ(xᵢ)·φ(xⱼ) Dual (kernelized — works in any feature space) max Σ αᵢ − ½ Σᵢ Σⱼ αᵢ αⱼ yᵢ yⱼ K(xᵢ, xⱼ) ← φ never computed

The primal needs . The dual needs only the kernel matrix. When — or when — the dual is the only tractable option.

Phase 2 — The Gram Matrix

The full set of pairwise kernel values between all training points is the Gram matrix , where . This is the only structure the SVM optimizer ever sees — not the raw features, not .

The degree-2 homogeneous polynomial kernel corresponds to the feature map . The component captures the sign interaction that separates XOR.

Verifying the feature map works: for and (both positive class):

They map to the same point. For (negative class): . The component flips sign — exactly what separates the classes.

Now computing the full 4×4 Gram matrix. Every entry :

Pair

The full matrix (using ):

Same-class pairs (rows 1–2 and rows 3–4) have . Cross-class pairs have . The block structure in the kernel matrix is exactly the class structure.

Gram Matrix K — Polynomial Kernel (x·z)² x₁(+1) x₂(+1) x₃(−1) x₄(−1) x₁(+1) x₂(+1) x₃(−1) x₄(−1) 4 4 0 0 4 4 0 0 0 0 4 4 0 0 4 4 same class (+1) K = 4 same class (−1) K = 4 cross-class K = 0

The SVM optimizer solves the dual using only this matrix. The 3D feature space is never instantiated.

Phase 3 — Prediction via Kernel Evaluations

The dual solution for this symmetric XOR problem: by symmetry all four are equal. Setting and the constraint : ✓.

The dual objective becomes:

Setting the derivative to zero: for all four points.

The bias : substituting (a support vector) into :

Since : .

Now predict (falls in Q1, should be class +1). Compute for each support vector:

Support vector (contribution)

Positive — correctly classified as +1. The votes from and are exactly zero because their sign interaction () is orthogonal to 's orientation.

Prediction for x_test = (0.5, 0.5) 0 +0.125 Support Vectors +0.125 x₁=(1,1) y=+1, K=1.0 +0.125 x₂=(−1,−1) y=+1, K=1.0 0 x₃=(1,−1) y=−1, K=0 0 x₄=(−1,1) y=−1, K=0 Σ = +0.25 → class +1

Phase 4 — Why RBF Is Infinite-Dimensional

For the polynomial kernel, the lifted space is finite: degree-2 maps . The RBF kernel corresponds to an infinite-dimensional space.

For scalar inputs, expand via :

This is a dot product where:

One component for every polynomial degree from 0 to . The feature map is infinite-dimensional. Yet computing costs — the same as an ordinary dot product.

RBF kernel values () on representative anchor pairs:

PairRelationship
vs Same point
vs Cross-class, adjacent
vs Cross-class, adjacent
vs Same class, diagonal

The same-class pair is further apart than the cross-class pairs . XOR puts same-class points in opposite quadrants. RBF solves it not by finding same-class proximity but by fitting a complex boundary around the Q1/Q3 region.

RBF Kernel Similarity Decay (γ = 0.5) Squared distance ‖xᵢ − xⱼ‖² 0 4 8 K=1.0 x₁ vs x₁ K=0.135 x₁ vs x₃ or x₄ (cross-class) K=0.018 x₁ vs x₂ (same class!) decay curve Same-class points (Q1, Q3) are farther apart than cross-class points — RBF solves XOR by boundary shape, not proximity

γ Sensitivity

γ controls how quickly the RBF kernel decays with distance. Small γ: smooth kernel — each point's influence extends far. Large γ: sharp kernel — each point only sees itself.

python
from sklearn.svm import SVC
import numpy as np

X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])

for gamma in [0.05, 0.5, 2, 20]:
    svm = SVC(kernel='rbf', C=10, gamma=gamma)
    svm.fit(X, y)
    print(f"gamma={gamma:>5}: acc={svm.score(X,y):.2f}, n_SV={len(svm.support_vectors_)}")
text
gamma= 0.05: acc=0.50, n_SV=4
gamma=  0.5: acc=1.00, n_SV=4
gamma=    2: acc=1.00, n_SV=4
gamma=   20: acc=1.00, n_SV=4

At : the kernel is so smooth that for all pairs — the optimizer sees every point as equally similar to every other point and cannot find a separating boundary. At : the decay is fast enough to distinguish the XOR structure. At : each point's kernel value drops to near zero within a tiny radius — the SVM draws tight circles around each training point (all 4 become support vectors) and will fail on any test point not close to a training point.

Computational Cost

ApproachFeature storagePrediction costFeasibility
Explicit (degree-2, ) floatsTrivial
Explicit (degree-10, ) floatsMarginal
Explicit (RBF) dimensionsImpossibleNever
Kernel Gram matrixFeasible when small

The kernel wins when . The trap: kernel SVMs compute the full Gram matrix during training. At , that is float64 values — roughly 80 GB. This is why kernel SVMs are replaced by approximations at scale (Nyström method, random Fourier features) once exceeds roughly 50,000.

python
from sklearn.svm import SVC
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
import numpy as np

X = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1]])
y = np.array([1, 1, -1, -1])

exact = SVC(kernel='rbf', C=10, gamma=0.5)
approx = make_pipeline(
    RBFSampler(gamma=0.5, n_components=100, random_state=42),
    SGDClassifier(max_iter=1000, random_state=42)
)
exact.fit(X, y)
approx.fit(X, y)
print(f"Exact kernel SVM:   {exact.score(X, y):.2f}")
print(f"Approx (RFF, n=100): {approx.score(X, y):.2f}")
text
Exact kernel SVM:    1.00
Approx (RFF, n=100): 1.00

RBFSampler approximates the RBF kernel via random Fourier features — maps inputs to a fixed -dimensional space (here ) and then trains a linear model. Same accuracy on this tiny dataset, but the approach scales to millions of samples.

This post depends entirely on the dual SVM formulation from post 02 — specifically that the dual objective collapses out and retains only pairwise inner products . Without the dual, there is no substitution point. From here, the Gram matrix structure generalizes directly: kernel PCA replaces the covariance matrix with , kernel ridge regression solves instead of , and Gaussian processes use as the prior covariance. Mercer's condition (positive semi-definiteness of ) — covered in post 03 — is the formal requirement that makes the substitution valid.

The kernel trick is structurally tied to models whose training objective collapses to dot products — SVMs, kernel ridge regression, kernel PCA. Decision trees, random forests, and gradient boosting have no dual formulation of this form and cannot be kernelized. Logistic regression can be kernelized (via the dual of the regularized problem) but rarely is in practice. At the Gram matrix becomes the bottleneck: 80 GB at in float64, before even running an optimizer. The correct response is the Nyström approximation (subsample points, approximate ) or random Fourier features, both of which trade approximation error for tractable memory. Finally, choosing the wrong kernel is equivalent to choosing the wrong model family: a polynomial kernel of degree 3 on data with a degree-2 structure will overfit; a linear kernel on XOR will fail entirely. Kernel selection is a hyperparameter search problem, not a free choice.

Test Your Understanding

  1. The dual objective contains . In the primal, the analogous term is . Show algebraically that these are equal when — this is why the dual value equals the primal value at optimum.

  2. The Gram matrix for the XOR anchor with kernel has a perfect block structure. If you added a fifth point (positive class), compute its row in the Gram matrix. Does the block structure hold?

  3. At the RBF kernel fails on XOR (50% accuracy). Using the Taylor expansion, explain why small causes the kernel to assign nearly equal similarity to all pairs — what does this do to the Gram matrix?

  4. The Nyström approximation builds a rank- approximation of the Gram matrix by selecting "landmark" points. If and , what are the storage requirements for the Nyström approximation vs the full Gram matrix?

  5. The polynomial kernel separates XOR perfectly, but the kernel does not (all off-diagonal entries equal 1, giving no class separation). Expand and identify which terms in the feature map are the same for all XOR points — this reveals why the intercept destroys the structure.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment