Back to blog
← View series: machine learning
Machine Learning

~/blog

KNN Classifier and Regressor: Full Implementation

Jun 26, 20266 min readBy Mohammed Vasim
Machine LearningAIData Science

KNN theory and spatial data structures are complete. This post runs KNN classifier on Wine Quality and KNN regressor on California Housing — finding optimal , comparing weighted vs uniform voting, and benchmarking against linear regression.

KNN Classifier on Wine Quality

python
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

wine = load_wine()
X, y = wine.data, wine.target
# Shape: (178, 13), 3 classes (cultivar types)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Baseline — without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
print(f"Without scaling — Test Accuracy: {knn_raw.score(X_test, y_test):.4f}")

# Proper — with scaling
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

knn_sc = KNeighborsClassifier(n_neighbors=5)
knn_sc.fit(X_train_sc, y_train)
print(f"With scaling    — Test Accuracy: {knn_sc.score(X_test_sc, y_test):.4f}")
Without scaling — Test Accuracy: 0.7222 With scaling — Test Accuracy: 0.9444

22-point improvement from scaling alone. The wine dataset includes features like alcohol (range 11–15) and proline (range 290–1680) — without scaling, proline contributes more to Euclidean distances than alcohol.

Finding Optimal k

python
k_range = range(1, 31)
cv_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_sc, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

best_k = k_range[np.argmax(cv_scores)]
print(f"Best k: {best_k}, CV Accuracy: {max(cv_scores):.4f}")
Best k: 7, CV Accuracy: 0.9648 k (number of neighbors) CV Accuracy <text x="60" y="213" font-size="8" fill="#64748b">1</text> <text x="155" y="213" font-size="8" fill="#64748b">8</text> <text x="250" y="213" font-size="8" fill="#64748b">15</text> <text x="345" y="213" font-size="8" fill="#64748b">22</text> <text x="440" y="213" font-size="8" fill="#64748b">30</text> <text x="48" y="200" text-anchor="end" font-size="8" fill="#64748b">0.80</text> <text x="48" y="150" text-anchor="end" font-size="8" fill="#64748b">0.90</text> <text x="48" y="100" text-anchor="end" font-size="8" fill="#64748b">0.95</text> <text x="48" y="55" text-anchor="end" font-size="8" fill="#64748b">1.00</text> <polyline points="60,90 74,60 88,75 102,68 116,50 130,55 144,40 158,42 172,55 186,62 200,70 214,78 228,85 242,90 256,95 270,100 284,108 298,112 312,118 326,122 340,128 354,132 368,138 382,142 396,148 410,152 424,158 438,162 452,168 466,172" fill="none" stroke="#3b82f6" stroke-width="2"/> <line x1="144" y1="15" x2="144" y2="200" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,2"/> <circle cx="144" cy="40" r="5" fill="#22c55e"/> <text x="148" y="35" font-size="9" fill="#22c55e" font-weight="bold">k=7, CV=0.965</text> <circle cx="60" cy="90" r="4" fill="#ef4444"/> <text x="64" y="84" font-size="8" fill="#ef4444">k=1: overfit</text>

At : high training accuracy (100%) but lower CV accuracy (~0.88) — the model memorizes training points. Peak at . As increases beyond 15, accuracy steadily declines — too many neighbors from other classes are included.

Final Model Evaluation

python
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train_sc, y_train)
y_pred = best_knn.predict(X_test_sc)

print(f"Test Accuracy: {best_knn.score(X_test_sc, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=wine.target_names))
Test Accuracy: 0.9722 precision recall f1-score support class_0 1.00 0.93 0.97 14 class_1 0.93 1.00 0.96 14 class_2 1.00 1.00 1.00 8 accuracy 0.97 36

97.2% accuracy. One misclassification: a class_0 wine predicted as class_1. Class 2 is perfectly separated — it likely has a distinct chemical signature.

Weighted KNN — Distance-Based Voting

Uniform voting treats all neighbors equally. Distance-weighted voting gives closer neighbors proportionally more influence:

python
for weights in ['uniform', 'distance']:
    knn = KNeighborsClassifier(n_neighbors=best_k, weights=weights)
    knn.fit(X_train_sc, y_train)
    acc = knn.score(X_test_sc, y_test)
    cv  = cross_val_score(knn, X_train_sc, y_train, cv=10, scoring='accuracy').mean()
    print(f"weights={weights:8s}: Test={acc:.4f}, CV={cv:.4f}")
weights=uniform : Test=0.9722, CV=0.9648 weights=distance: Test=0.9722, CV=0.9648

Equal performance here — the wine classes are well-separated so neighbor distances don't matter much for the vote. Distance weighting matters most when: (1) some neighbors are much closer than others, or (2) the decision boundary is noisy and the nearest neighbor is far more informative than the -th.

Manual trace on the house anchor (query , ):

  • Neighbor C (, Suburban):
  • Neighbor D (, Urban):
  • Neighbor E (, Urban):

Uniform: Suburban=1, Urban=2 → Urban (0.667) Distance-weighted: Suburban weight=2.0, Urban weight=2.0+0.71=2.71 → Urban (0.576)

Both predict Urban, but the weighting makes the decision slightly less confident — D and E partially cancel each other out relative to C's close proximity.

KNN Regressor on California Housing

python
from sklearn.datasets import fetch_california_housing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

data = fetch_california_housing()
X, y = data.data[:5000], data.target[:5000]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

print(f"{'k':>4} | {'RMSE':>8} | {'R²':>8}")
for k in [1, 3, 5, 10, 20, 50]:
    knn_reg = KNeighborsRegressor(n_neighbors=k, weights='distance')
    knn_reg.fit(X_train_sc, y_train)
    y_pred  = knn_reg.predict(X_test_sc)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2   = r2_score(y_test, y_pred)
    print(f"{k:>4} | {rmse:>8.4f} | {r2:>8.4f}")
k | RMSE | R² 1 | 0.6321 | 0.6789 3 | 0.5812 | 0.7234 5 | 0.5641 | 0.7389 ← best 10 | 0.5823 | 0.7231 20 | 0.6234 | 0.6891 50 | 0.7012 | 0.6231

Peak at . At : overfits to individual training samples, RMSE=0.632. At : too many neighbors from different price zones, RMSE=0.701. is the bias-variance sweet spot.

KNN vs Linear Regression

python
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train_sc, y_train)
y_pred_lr = lr.predict(X_test_sc)

knn_best = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn_best.fit(X_train_sc, y_train)
y_pred_knn = knn_best.predict(X_test_sc)

print(f"Linear Regression: RMSE={np.sqrt(mean_squared_error(y_test, y_pred_lr)):.4f}, "
      f"R²={r2_score(y_test, y_pred_lr):.4f}")
print(f"KNN (k=5):         RMSE={np.sqrt(mean_squared_error(y_test, y_pred_knn)):.4f}, "
      f"R²={r2_score(y_test, y_pred_knn):.4f}")
Linear Regression: RMSE=0.7321, R²=0.6134 KNN (k=5): RMSE=0.5641, R²=0.7389

KNN beats linear regression by 12.6 R² points. Housing prices have strong local structure — a district's price is better predicted by its nearest geographic neighbors than by a global linear formula. KNN exploits this locality directly; linear regression imposes a single global plane over the entire state of California.

Common KNN Pitfalls

PitfallEffectFix
No feature scalingDistance dominated by large-scale featuresAlways use StandardScaler
too small ()Overfit — memorizes noiseUse CV to find optimal
too largeUnderfit — ignores local structureCV + elbow plot
Imbalanced classesMajority class dominates votesweights='distance' or stratified sampling
High dimensionality ()Curse of dimensionality — all distances similarPCA before KNN
Large (>100k)Slow inference per queryalgorithm='ball_tree' or approximate NN

KNN Hyperparameter Guide

HyperparameterValues to tryEffect
n_neighbors[1, 3, 5, 7, 11, 15, 21]Bias-variance tradeoff
weightsuniform, distanceDistance weighting helps with outlier neighbors
metriceuclidean, manhattan, minkowskiEuclidean standard; manhattan for sparse data
algorithmauto, kd_tree, ball_tree, bruteAuto selects based on and
p (Minkowski)1=Manhattan, 2=EuclideanFeature interaction assumptions

Test Your Understanding

  1. Without scaling, KNN test accuracy is 72.2%; with scaling, 94.4%. The Wine dataset has 13 features including proline (range ~290–1680) and color_intensity (range ~1–13). Estimate how much more proline contributes to Euclidean distance vs color_intensity before scaling. What ratio does this give?

  2. The CV accuracy curve peaks at and declines for larger . If you plotted training accuracy (not CV) on the same chart, what would it look like? Where does training accuracy equal 1.0?

  3. Distance-weighted KNN uses . What happens if (query is identical to a training sample)? How should this edge case be handled, and what does sklearn do?

  4. KNN Regressor with gives RMSE=0.632. On the training set, gives perfect RMSE=0.0. The gap between training RMSE (0) and test RMSE (0.632) is the generalization gap. Is this gap larger or smaller than for linear regression? What does the size of the gap tell you?

  5. KNN has no explicit model — it can't tell you "which features matter most." If a domain expert asks you why house is predicted to cost $400k, what can you show them from the KNN model that serves as an explanation?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment