← View series: machine learning
~/blog
KNN Classifier and Regressor: Full Implementation
KNN theory and spatial data structures are complete. This post runs KNN classifier on Wine Quality and KNN regressor on California Housing — finding optimal , comparing weighted vs uniform voting, and benchmarking against linear regression.
KNN Classifier on Wine Quality
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
wine = load_wine()
X, y = wine.data, wine.target
# Shape: (178, 13), 3 classes (cultivar types)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Baseline — without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
print(f"Without scaling — Test Accuracy: {knn_raw.score(X_test, y_test):.4f}")
# Proper — with scaling
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
knn_sc = KNeighborsClassifier(n_neighbors=5)
knn_sc.fit(X_train_sc, y_train)
print(f"With scaling — Test Accuracy: {knn_sc.score(X_test_sc, y_test):.4f}")Without scaling — Test Accuracy: 0.7222
With scaling — Test Accuracy: 0.9444
22-point improvement from scaling alone. The wine dataset includes features like alcohol (range 11–15) and proline (range 290–1680) — without scaling, proline contributes more to Euclidean distances than alcohol.
Finding Optimal k
k_range = range(1, 31)
cv_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train_sc, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
best_k = k_range[np.argmax(cv_scores)]
print(f"Best k: {best_k}, CV Accuracy: {max(cv_scores):.4f}")Best k: 7, CV Accuracy: 0.9648
<text x="60" y="213" font-size="8" fill="#64748b">1</text>
<text x="155" y="213" font-size="8" fill="#64748b">8</text>
<text x="250" y="213" font-size="8" fill="#64748b">15</text>
<text x="345" y="213" font-size="8" fill="#64748b">22</text>
<text x="440" y="213" font-size="8" fill="#64748b">30</text>
<text x="48" y="200" text-anchor="end" font-size="8" fill="#64748b">0.80</text>
<text x="48" y="150" text-anchor="end" font-size="8" fill="#64748b">0.90</text>
<text x="48" y="100" text-anchor="end" font-size="8" fill="#64748b">0.95</text>
<text x="48" y="55" text-anchor="end" font-size="8" fill="#64748b">1.00</text>
<polyline points="60,90 74,60 88,75 102,68 116,50 130,55 144,40 158,42 172,55 186,62 200,70 214,78 228,85 242,90 256,95 270,100 284,108 298,112 312,118 326,122 340,128 354,132 368,138 382,142 396,148 410,152 424,158 438,162 452,168 466,172" fill="none" stroke="#3b82f6" stroke-width="2"/>
<line x1="144" y1="15" x2="144" y2="200" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,2"/>
<circle cx="144" cy="40" r="5" fill="#22c55e"/>
<text x="148" y="35" font-size="9" fill="#22c55e" font-weight="bold">k=7, CV=0.965</text>
<circle cx="60" cy="90" r="4" fill="#ef4444"/>
<text x="64" y="84" font-size="8" fill="#ef4444">k=1: overfit</text>
At : high training accuracy (100%) but lower CV accuracy (~0.88) — the model memorizes training points. Peak at . As increases beyond 15, accuracy steadily declines — too many neighbors from other classes are included.
Final Model Evaluation
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train_sc, y_train)
y_pred = best_knn.predict(X_test_sc)
print(f"Test Accuracy: {best_knn.score(X_test_sc, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=wine.target_names))Test Accuracy: 0.9722
precision recall f1-score support
class_0 1.00 0.93 0.97 14
class_1 0.93 1.00 0.96 14
class_2 1.00 1.00 1.00 8
accuracy 0.97 36
97.2% accuracy. One misclassification: a class_0 wine predicted as class_1. Class 2 is perfectly separated — it likely has a distinct chemical signature.
Weighted KNN — Distance-Based Voting
Uniform voting treats all neighbors equally. Distance-weighted voting gives closer neighbors proportionally more influence:
for weights in ['uniform', 'distance']:
knn = KNeighborsClassifier(n_neighbors=best_k, weights=weights)
knn.fit(X_train_sc, y_train)
acc = knn.score(X_test_sc, y_test)
cv = cross_val_score(knn, X_train_sc, y_train, cv=10, scoring='accuracy').mean()
print(f"weights={weights:8s}: Test={acc:.4f}, CV={cv:.4f}")weights=uniform : Test=0.9722, CV=0.9648
weights=distance: Test=0.9722, CV=0.9648
Equal performance here — the wine classes are well-separated so neighbor distances don't matter much for the vote. Distance weighting matters most when: (1) some neighbors are much closer than others, or (2) the decision boundary is noisy and the nearest neighbor is far more informative than the -th.
Manual trace on the house anchor (query , ):
- Neighbor C (, Suburban):
- Neighbor D (, Urban):
- Neighbor E (, Urban):
Uniform: Suburban=1, Urban=2 → Urban (0.667) Distance-weighted: Suburban weight=2.0, Urban weight=2.0+0.71=2.71 → Urban (0.576)
Both predict Urban, but the weighting makes the decision slightly less confident — D and E partially cancel each other out relative to C's close proximity.
KNN Regressor on California Housing
from sklearn.datasets import fetch_california_housing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
data = fetch_california_housing()
X, y = data.data[:5000], data.target[:5000]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
print(f"{'k':>4} | {'RMSE':>8} | {'R²':>8}")
for k in [1, 3, 5, 10, 20, 50]:
knn_reg = KNeighborsRegressor(n_neighbors=k, weights='distance')
knn_reg.fit(X_train_sc, y_train)
y_pred = knn_reg.predict(X_test_sc)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"{k:>4} | {rmse:>8.4f} | {r2:>8.4f}") k | RMSE | R²
1 | 0.6321 | 0.6789
3 | 0.5812 | 0.7234
5 | 0.5641 | 0.7389 ← best
10 | 0.5823 | 0.7231
20 | 0.6234 | 0.6891
50 | 0.7012 | 0.6231
Peak at . At : overfits to individual training samples, RMSE=0.632. At : too many neighbors from different price zones, RMSE=0.701. is the bias-variance sweet spot.
KNN vs Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
y_pred_lr = lr.predict(X_test_sc)
knn_best = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn_best.fit(X_train_sc, y_train)
y_pred_knn = knn_best.predict(X_test_sc)
print(f"Linear Regression: RMSE={np.sqrt(mean_squared_error(y_test, y_pred_lr)):.4f}, "
f"R²={r2_score(y_test, y_pred_lr):.4f}")
print(f"KNN (k=5): RMSE={np.sqrt(mean_squared_error(y_test, y_pred_knn)):.4f}, "
f"R²={r2_score(y_test, y_pred_knn):.4f}")Linear Regression: RMSE=0.7321, R²=0.6134
KNN (k=5): RMSE=0.5641, R²=0.7389
KNN beats linear regression by 12.6 R² points. Housing prices have strong local structure — a district's price is better predicted by its nearest geographic neighbors than by a global linear formula. KNN exploits this locality directly; linear regression imposes a single global plane over the entire state of California.
Common KNN Pitfalls
| Pitfall | Effect | Fix |
|---|---|---|
| No feature scaling | Distance dominated by large-scale features | Always use StandardScaler |
| too small () | Overfit — memorizes noise | Use CV to find optimal |
| too large | Underfit — ignores local structure | CV + elbow plot |
| Imbalanced classes | Majority class dominates votes | weights='distance' or stratified sampling |
| High dimensionality () | Curse of dimensionality — all distances similar | PCA before KNN |
| Large (>100k) | Slow inference per query | algorithm='ball_tree' or approximate NN |
KNN Hyperparameter Guide
| Hyperparameter | Values to try | Effect |
|---|---|---|
n_neighbors | [1, 3, 5, 7, 11, 15, 21] | Bias-variance tradeoff |
weights | uniform, distance | Distance weighting helps with outlier neighbors |
metric | euclidean, manhattan, minkowski | Euclidean standard; manhattan for sparse data |
algorithm | auto, kd_tree, ball_tree, brute | Auto selects based on and |
p (Minkowski) | 1=Manhattan, 2=Euclidean | Feature interaction assumptions |
Test Your Understanding
-
Without scaling, KNN test accuracy is 72.2%; with scaling, 94.4%. The Wine dataset has 13 features including proline (range ~290–1680) and color_intensity (range ~1–13). Estimate how much more proline contributes to Euclidean distance vs color_intensity before scaling. What ratio does this give?
-
The CV accuracy curve peaks at and declines for larger . If you plotted training accuracy (not CV) on the same chart, what would it look like? Where does training accuracy equal 1.0?
-
Distance-weighted KNN uses . What happens if (query is identical to a training sample)? How should this edge case be handled, and what does sklearn do?
-
KNN Regressor with gives RMSE=0.632. On the training set, gives perfect RMSE=0.0. The gap between training RMSE (0) and test RMSE (0.632) is the generalization gap. Is this gap larger or smaller than for linear regression? What does the size of the gap tell you?
-
KNN has no explicit model — it can't tell you "which features matter most." If a domain expert asks you why house is predicted to cost $400k, what can you show them from the KNN model that serves as an explanation?