← View series: machine learning
~/blog
Instance-Based vs Model-Based Learning
After training, does the algorithm throw the data away or keep it? That single question separates the two fundamental strategies in ML. Getting the answer wrong means deploying a model that's either too slow to serve predictions or too rigid to capture the real pattern in the data.
What "Learning" Means Here
Mitchell's definition: a program is said to learn from experience E with respect to task T and performance measure P, if its performance at T, as measured by P, improves with experience E. For house price prediction: E = 6 training examples, T = predict price, P = MSE.
The key question isn't whether it learns — it's what it keeps from that experience. Does it compress the data into a compact model, or does it keep the raw examples?
Anchor dataset:
X = [650, 850, 1100, 1400, 1600, 1900] # sq_ft
y = [180, 220, 280, 340, 370, 430] # price in $k
# Query: predict price for sq_ft = 1000Model-Based Learning
Model-based (eager) learning fits a parametric function to the training data and extracts a compact summary: the parameters. Once training ends, the training data can be discarded.
Linear regression on our 6-point anchor yields , . At inference:
The six training rows are gone. Only two numbers remain — and . For a second query:
Inference time is — a single dot product. Memory is — just the weight vector. For features, that's 1000 numbers regardless of whether the training set had 100 or 100 million samples.
Algorithms in this class: Linear Regression, Logistic Regression, SVM, Neural Networks, Naive Bayes, Decision Trees (once built).
Instance-Based Learning (Lazy Learning)
Instance-based (lazy) learning memorizes the entire training set. There is no fitting phase — the "training" step is just storing the data. All computation is deferred to inference.
KNN () on the same anchor for query sq_ft = 1000:
Distances to each training point:
| Training sq_ft | Distance from 1000 | Price |
|---|---|---|
| 650 | |1000 − 650| = 350 | 180 |
| 850 | |1000 − 850| = 150 ✓ | 220 |
| 1100 | |1000 − 1100| = 100 ✓ | 280 |
| 1400 | |1000 − 1400| = 400 | 340 |
| 1600 | |1000 − 1600| = 600 | 370 |
| 1900 | |1000 − 1900| = 900 | 430 |
Two nearest: sq_ft = 850 (price 220) and sq_ft = 1100 (price 280).
For a second query, sq_ft = 800:
Nearest neighbors: 650 (price 180) and 850 (price 220).
Linear regression gave \213.3k$ for this same query. These are different predictions — not by coincidence, but structurally. KNN interpolates locally from the two closest neighbors. Linear regression fits a single global line. On data that isn't perfectly linear, they will consistently disagree in regions far from training points.
Inference time is — compute distance to every stored training point. Memory is — all training data must be kept. At million samples, that's expensive.
<text x="30" y="55" font-size="11" fill="#64748b">Train:</text>
<rect x="70" y="40" width="140" height="22" rx="4" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="140" y="55" text-anchor="middle" font-size="11" fill="#334155">Fit w₀, w₁ (slow)</text>
<text x="310" y="55" font-size="11" fill="#64748b">Train:</text>
<rect x="350" y="40" width="140" height="22" rx="4" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1.5"/>
<text x="420" y="55" text-anchor="middle" font-size="11" fill="#94a3b8">Store data (instant)</text>
<text x="30" y="100" font-size="11" fill="#64748b">Infer:</text>
<rect x="70" y="85" width="60" height="22" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="100" y="100" text-anchor="middle" font-size="11" fill="#334155">ŷ = w·x</text>
<text x="145" y="98" font-size="10" fill="#22c55e">fast O(p)</text>
<text x="310" y="100" font-size="11" fill="#64748b">Infer:</text>
<rect x="350" y="85" width="140" height="22" rx="4" fill="#fee2e2" stroke="#dc2626" stroke-width="1.5"/>
<text x="420" y="100" text-anchor="middle" font-size="11" fill="#334155">search all n points</text>
<text x="420" y="118" text-anchor="middle" font-size="10" fill="#dc2626">slow O(n)</text>
<text x="30" y="155" font-size="11" fill="#64748b">Memory:</text>
<rect x="90" y="140" width="90" height="22" rx="4" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="135" y="155" text-anchor="middle" font-size="11" fill="#334155">O(p) params</text>
<text x="310" y="155" font-size="11" fill="#64748b">Memory:</text>
<rect x="370" y="140" width="110" height="22" rx="4" fill="#fee2e2" stroke="#dc2626" stroke-width="1.5"/>
<text x="425" y="155" text-anchor="middle" font-size="11" fill="#334155">O(n) training rows</text>
When the Difference Matters: Four Scenarios
1. Large dataset, real-time inference (loan approval at a bank, ): KNN must compute distances per query — hundreds of milliseconds per decision. Use model-based (logistic regression). Inference is one dot product — sub-millisecond.
2. Streaming data that changes over time (user preference prediction): Instance-based wins — append new examples without retraining. Model-based requires periodic full retrains, which may take hours for large models.
3. Non-linear local patterns (housing prices by neighborhood): KNN captures the local cluster around each query point. A single global linear model may underfit neighborhoods that don't follow the citywide trend.
4. Interpretability required (medical diagnosis): Model-based (logistic regression, decision tree) — the physician can inspect the coefficients or rules. KNN offers no such explanation: "your nearest neighbors voted default" isn't useful.
Generalization: The Core Tradeoff
Model-based generalizes via the parametric assumption. If the true relationship is linear and you have very little data, a linear model generalizes from 3 points to any . The downside: if the assumption is wrong, it's wrong everywhere — a systematic global error.
Instance-based generalizes by similarity — a new point inherits the labels of its nearest training points. No assumption about the global shape. The downside: in high dimensions, "nearest" stops being meaningful. When , two training points can be the "closest" while still being geometrically far away — a problem called the curse of dimensionality.
Comparison Table
| Aspect | Model-Based | Instance-Based |
|---|---|---|
| Training phase | Fits parameters | Stores data (no fitting) |
| Inference cost | — constant | — grows with data |
| Memory cost | — compact | — grows with data |
| Assumptions | Global: data follows a parametric form | Local: nearby points are similar |
| Adapts to new data | Requires retraining | Just add new row to store |
| Interpretable? | Yes — inspect weights | No — result depends on neighbors |
| Handles local patterns? | Poorly (single global fit) | Yes — local shape captured |
Decision Guide
| Condition | Prefer |
|---|---|
| Fast inference needed | Model-based |
| Training data changes frequently | Instance-based |
| Data has global linear/polynomial structure | Model-based |
| Data has local clusters or non-linear patterns | Instance-based |
| High dimensionality () | Model-based |
| Small dataset, low dimensionality | Either |
Related Concepts and Honest Limitations
KD-trees and ball-trees reduce KNN inference from to in low dimensions — but this speedup evaporates above roughly features, where the tree degenerates. That's why approximate nearest-neighbor methods (HNSW, FAISS) are used in high-dimensional retrieval systems.
Model-based learning isn't automatically safe from high-dimensional failure either: with more features than samples (), OLS is undefined (singular matrix), and even regularized models need careful treatment. The curse of dimensionality hits everyone, just differently.
Test Your Understanding
-
For the anchor dataset, compute KNN () predictions for sq_ft = 1250. Then compute the linear regression prediction for the same query. Which is higher, and why does the difference arise?
-
You add a new training sample (sq_ft = 1050, price = 265) to the dataset. How does each approach handle this? Which requires more work?
-
An instance-based model "memorizes" training data exactly. Can it overfit? What would overfitting look like for ?
-
For a 100-feature dataset with samples, you're choosing between logistic regression (model-based) and KNN (instance-based). What factors push you toward logistic regression?
-
Why does the KNN inference cost not depend on the number of features , while a linear model's inference cost does? Which grows faster with scale, and when does the crossover matter?