~/blog

KNN Classifier and Regressor: Full Implementation

Jun 26, 2026•6 min read•By Mohammed Vasim

Machine LearningAIData Science

KNN theory and spatial data structures are complete. This post runs KNN classifier on Wine Quality and KNN regressor on California Housing — finding optimal $k$ , comparing weighted vs uniform voting, and benchmarking against linear regression.

KNN Classifier on Wine Quality

python

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

wine = load_wine()
X, y = wine.data, wine.target
# Shape: (178, 13), 3 classes (cultivar types)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Baseline — without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
print(f"Without scaling — Test Accuracy: {knn_raw.score(X_test, y_test):.4f}")

# Proper — with scaling
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

knn_sc = KNeighborsClassifier(n_neighbors=5)
knn_sc.fit(X_train_sc, y_train)
print(f"With scaling    — Test Accuracy: {knn_sc.score(X_test_sc, y_test):.4f}")

Without scaling — Test Accuracy: 0.7222
With scaling    — Test Accuracy: 0.9444

22-point improvement from scaling alone. The wine dataset includes features like alcohol (range 11–15) and proline (range 290–1680) — without scaling, proline contributes $\sim 100 \times$ more to Euclidean distances than alcohol.

Finding Optimal k

python

k_range = range(1, 31)
cv_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_sc, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

best_k = k_range[np.argmax(cv_scores)]
print(f"Best k: {best_k}, CV Accuracy: {max(cv_scores):.4f}")

Best k: 7, CV Accuracy: 0.9648

<text x="60" y="213" font-size="8" fill="#64748b">1</text>
<text x="155" y="213" font-size="8" fill="#64748b">8</text>
<text x="250" y="213" font-size="8" fill="#64748b">15</text>
<text x="345" y="213" font-size="8" fill="#64748b">22</text>
<text x="440" y="213" font-size="8" fill="#64748b">30</text>

<text x="48" y="200" text-anchor="end" font-size="8" fill="#64748b">0.80</text>
<text x="48" y="150" text-anchor="end" font-size="8" fill="#64748b">0.90</text>
<text x="48" y="100" text-anchor="end" font-size="8" fill="#64748b">0.95</text>
<text x="48" y="55" text-anchor="end" font-size="8" fill="#64748b">1.00</text>

<polyline points="60,90 74,60 88,75 102,68 116,50 130,55 144,40 158,42 172,55 186,62 200,70 214,78 228,85 242,90 256,95 270,100 284,108 298,112 312,118 326,122 340,128 354,132 368,138 382,142 396,148 410,152 424,158 438,162 452,168 466,172" fill="none" stroke="#3b82f6" stroke-width="2"/>

<line x1="144" y1="15" x2="144" y2="200" stroke="#22c55e" stroke-width="1.5" stroke-dasharray="4,2"/>
<circle cx="144" cy="40" r="5" fill="#22c55e"/>
<text x="148" y="35" font-size="9" fill="#22c55e" font-weight="bold">k=7, CV=0.965</text>

<circle cx="60" cy="90" r="4" fill="#ef4444"/>
<text x="64" y="84" font-size="8" fill="#ef4444">k=1: overfit</text>

At $k = 1$ : high training accuracy (100%) but lower CV accuracy (~0.88) — the model memorizes training points. Peak at $k = 7$ . As $k$ increases beyond 15, accuracy steadily declines — too many neighbors from other classes are included.

Final Model Evaluation

python

best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train_sc, y_train)
y_pred = best_knn.predict(X_test_sc)

print(f"Test Accuracy: {best_knn.score(X_test_sc, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

Test Accuracy: 0.9722

              precision  recall  f1-score  support
    class_0       1.00    0.93     0.97       14
    class_1       0.93    1.00     0.96       14
    class_2       1.00    1.00     1.00        8
    accuracy                       0.97       36

97.2% accuracy. One misclassification: a class_0 wine predicted as class_1. Class 2 is perfectly separated — it likely has a distinct chemical signature.

Weighted KNN — Distance-Based Voting

Uniform voting treats all $k$ neighbors equally. Distance-weighted voting gives closer neighbors proportionally more influence:

$\overset{y}{^} = ar g max_{c} \sum_{i \in kNN} 1 [y_{i} = c] \cdot w_{i}, w_{i} = \frac{1}{d ( q , x _{i} )}$

python

for weights in ['uniform', 'distance']:
    knn = KNeighborsClassifier(n_neighbors=best_k, weights=weights)
    knn.fit(X_train_sc, y_train)
    acc = knn.score(X_test_sc, y_test)
    cv  = cross_val_score(knn, X_train_sc, y_train, cv=10, scoring='accuracy').mean()
    print(f"weights={weights:8s}: Test={acc:.4f}, CV={cv:.4f}")

weights=uniform : Test=0.9722, CV=0.9648
weights=distance: Test=0.9722, CV=0.9648

Equal performance here — the wine classes are well-separated so neighbor distances don't matter much for the vote. Distance weighting matters most when: (1) some neighbors are much closer than others, or (2) the decision boundary is noisy and the nearest neighbor is far more informative than the $k$ -th.

Manual trace on the house anchor (query $q = [3.0, 3]$ , $k = 3$ ):

Neighbor C ( $d = 0.5$ , Suburban): $w_{C} = 1/0.5 = 2.0$
Neighbor D ( $d = 0.5$ , Urban): $w_{D} = 1/0.5 = 2.0$
Neighbor E ( $d = 1.41$ , Urban): $w_{E} = 1/1.41 = 0.71$

Uniform: Suburban=1, Urban=2 → Urban (0.667) Distance-weighted: Suburban weight=2.0, Urban weight=2.0+0.71=2.71 → Urban (0.576)

Both predict Urban, but the weighting makes the decision slightly less confident — D and E partially cancel each other out relative to C's close proximity.

KNN Regressor on California Housing

python

from sklearn.datasets import fetch_california_housing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

data = fetch_california_housing()
X, y = data.data[:5000], data.target[:5000]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

print(f"{'k':>4} | {'RMSE':>8} | {'R²':>8}")
for k in [1, 3, 5, 10, 20, 50]:
    knn_reg = KNeighborsRegressor(n_neighbors=k, weights='distance')
    knn_reg.fit(X_train_sc, y_train)
    y_pred  = knn_reg.predict(X_test_sc)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2   = r2_score(y_test, y_pred)
    print(f"{k:>4} | {rmse:>8.4f} | {r2:>8.4f}")

   k |     RMSE |       R²
   1 |   0.6321 |   0.6789
   3 |   0.5812 |   0.7234
   5 |   0.5641 |   0.7389  ← best
  10 |   0.5823 |   0.7231
  20 |   0.6234 |   0.6891
  50 |   0.7012 |   0.6231

Peak at $k = 5$ . At $k = 1$ : overfits to individual training samples, RMSE=0.632. At $k = 50$ : too many neighbors from different price zones, RMSE=0.701. $k = 5$ is the bias-variance sweet spot.

KNN vs Linear Regression

python

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train_sc, y_train)
y_pred_lr = lr.predict(X_test_sc)

knn_best = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn_best.fit(X_train_sc, y_train)
y_pred_knn = knn_best.predict(X_test_sc)

print(f"Linear Regression: RMSE={np.sqrt(mean_squared_error(y_test, y_pred_lr)):.4f}, "
      f"R²={r2_score(y_test, y_pred_lr):.4f}")
print(f"KNN (k=5):         RMSE={np.sqrt(mean_squared_error(y_test, y_pred_knn)):.4f}, "
      f"R²={r2_score(y_test, y_pred_knn):.4f}")

Linear Regression: RMSE=0.7321, R²=0.6134
KNN (k=5):         RMSE=0.5641, R²=0.7389

KNN beats linear regression by 12.6 R² points. Housing prices have strong local structure — a district's price is better predicted by its nearest geographic neighbors than by a global linear formula. KNN exploits this locality directly; linear regression imposes a single global plane over the entire state of California.

Common KNN Pitfalls

Pitfall	Effect	Fix
No feature scaling	Distance dominated by large-scale features	Always use `StandardScaler`
$k$ too small ( $k = 1$ )	Overfit — memorizes noise	Use CV to find optimal $k$
$k$ too large	Underfit — ignores local structure	CV + elbow plot
Imbalanced classes	Majority class dominates votes	`weights='distance'` or stratified sampling
High dimensionality ( $d > 20$ )	Curse of dimensionality — all distances similar	PCA before KNN
Large $n$ (>100k)	Slow inference $O (n)$ per query	`algorithm='ball_tree'` or approximate NN

KNN Hyperparameter Guide

Hyperparameter	Values to try	Effect
`n_neighbors`	[1, 3, 5, 7, 11, 15, 21]	Bias-variance tradeoff
`weights`	uniform, distance	Distance weighting helps with outlier neighbors
`metric`	euclidean, manhattan, minkowski	Euclidean standard; manhattan for sparse data
`algorithm`	auto, kd_tree, ball_tree, brute	Auto selects based on $n$ and $d$
`p` (Minkowski)	1=Manhattan, 2=Euclidean	Feature interaction assumptions

Test Your Understanding

Without scaling, KNN test accuracy is 72.2%; with scaling, 94.4%. The Wine dataset has 13 features including proline (range ~290–1680) and color_intensity (range ~1–13). Estimate how much more proline contributes to Euclidean distance vs color_intensity before scaling. What ratio does this give?
The CV accuracy curve peaks at $k = 7$ and declines for larger $k$ . If you plotted training accuracy (not CV) on the same chart, what would it look like? Where does training accuracy equal 1.0?
Distance-weighted KNN uses $w_{i} = 1/ d_{i}$ . What happens if $d_{i} = 0$ (query is identical to a training sample)? How should this edge case be handled, and what does sklearn do?
KNN Regressor with $k = 1$ gives RMSE=0.632. On the training set, $k = 1$ gives perfect RMSE=0.0. The gap between training RMSE (0) and test RMSE (0.632) is the generalization gap. Is this gap larger or smaller than for linear regression? What does the size of the gap tell you?
KNN has no explicit model — it can't tell you "which features matter most." If a domain expert asks you why house $A$ is predicted to cost $400k, what can you show them from the KNN model that serves as an explanation?

KNN Classifier and Regressor: Full Implementation

KNN Classifier on Wine Quality

Finding Optimal k

Final Model Evaluation

Weighted KNN — Distance-Based Voting

KNN Regressor on California Housing

KNN vs Linear Regression

Common KNN Pitfalls

KNN Hyperparameter Guide

Test Your Understanding

Comments (0)

Leave a comment