← View series: machine learning
~/blog
Types of Machine Learning
Every ML algorithm in this series fits into a broader taxonomy. Before touching a single equation, you need a clear map of what kind of problem you're solving — because choosing the wrong paradigm (clustering when you have labels, regression when your target is categorical) wastes weeks before you realize the mistake.
What Machine Learning Actually Is
Tom Mitchell's 1997 definition remains the most precise: a computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at T, as measured by P, improves with experience E.
That triplet — task, experience, performance — is what distinguishes ML from rule-based programming. In a rule-based spam filter, a human writes if "free money" in subject: mark_as_spam. The rule is explicit and brittle. In an ML spam filter, an algorithm processes thousands of labelled emails (experience), learns which word patterns correlate with spam (task), and minimizes misclassification rate (performance). The rules emerge from data; no human encodes them.
Three ingredients are always present: data (the experience), a learning algorithm (the mechanism), and a performance measure (what "better" means). Remove any one and you don't have machine learning — you have either a lookup table, a random process, or an undefined objective.
Supervised Learning
Supervised learning is the paradigm where training data consists of labelled input-output pairs . The goal is to learn a function that maps inputs to outputs. At test time, you hand the model an unseen and it predicts .
The data flow is: raw data → feature matrix → model → prediction → compare to true label → compute error → update model. Every supervised algorithm repeats this loop.
Regression — when is continuous. Predicting the sale price of a house from its square footage, number of bedrooms, and neighborhood. The model outputs a real number, and the error is measured in original units (dollars off, degrees off).
Classification — when is a discrete class. Classifying an email as spam or not-spam. Recognizing a handwritten digit (0–9). The model outputs a class label or a probability distribution over classes.
Key algorithms in supervised learning: Linear Regression, Logistic Regression, Decision Trees, SVMs, Neural Networks, KNN. Each appears later in this series with full derivations.
Unsupervised Learning
Unsupervised learning removes the labels. You have a feature matrix but no corresponding . The goal shifts from "learn to predict " to "find structure in ."
Clustering groups similar data points together without being told what the groups should be. K-Means applied to customer purchase histories might discover three clusters — frequent buyers, seasonal shoppers, and one-time purchasers — without anyone defining those categories in advance.
Dimensionality reduction compresses a high-dimensional dataset into fewer dimensions while preserving as much structure as possible. PCA reduces 100 gene-expression features to 2 principal components that capture most of the variance. t-SNE and UMAP do the same for visualization.
Density estimation and anomaly detection — learn the distribution of normal data, then flag points that don't fit. A transaction that looks nothing like the rest of your history is anomalous.
The fundamental difference from supervised learning: there is no ground truth to compare against. "Better" is measured by internal criteria (cluster cohesion, reconstruction error, likelihood) rather than a human-provided label.
Semi-Supervised and Self-Supervised Learning
Semi-supervised learning uses a small labelled set alongside a large unlabelled set. In medical imaging, labelling an MRI scan requires a radiologist — expensive and slow. You might have 200 labelled scans and 50,000 unlabelled ones. Semi-supervised methods use the unlabelled scans to improve the model trained on the 200.
Self-supervised learning is subtler: labels are generated automatically from the data itself, without human annotation. BERT predicts masked tokens in a sentence — the "label" for each masked word is the word that was there. SimCLR learns image representations by predicting whether two augmented versions of an image are the same image. The key insight is that the data contains its own supervisory signal.
This distinction matters because most real-world data is unlabelled. Self-supervised pretraining is what makes large language models and vision transformers work at scale.
Reinforcement Learning
Reinforcement learning operates on a different structure entirely. An agent acts in an environment, receives a reward (or penalty) signal, and learns a policy — a mapping from states to actions — that maximizes cumulative reward.
Key terms: state (what the environment looks like right now), action (what the agent can do), reward (feedback after each action), policy (the strategy being learned). AlphaGo learned to play Go by playing millions of games against itself. Robot locomotion controllers learn to walk by receiving positive reward for forward progress and negative reward for falling.
RL is not covered in this series — it requires dedicated treatment of Markov decision processes and temporal-difference learning. For a rigorous introduction, Sutton and Barto's Reinforcement Learning: An Introduction (freely available online) is the standard reference.
Parametric vs Non-Parametric Models
This distinction cuts across all paradigms and matters for choosing between algorithms.
Parametric models have a fixed number of parameters regardless of how much training data you have. Linear regression has exactly parameters () — feeding it 100 samples or 10 million samples doesn't change that count. Parameters are learned during training; at inference, you only need the parameters, not the training data. Fast at inference, interpretable, but constrained to whatever family of functions the parameters can express.
Non-parametric models let complexity grow with data. KNN stores every training point — with 1 million training examples, you carry 1 million data points to inference time. Kernel SVMs are similar. Non-parametric models make fewer assumptions and can represent more complex functions, but they're memory-intensive and slow at inference.
Linear regression — the subject of this series — is supervised, parametric, and regression-typed. The next post derives exactly what those parameters mean geometrically.
The General ML Workflow
Every ML project, regardless of algorithm, follows the same nine stages:
- Collect and understand data — EDA reveals distributions, missing values, outliers, and target imbalance before any model is built.
- Preprocess — handle missing values, encode categorical variables, scale numerical features so no single feature dominates by magnitude.
- Feature engineering — create or transform inputs to make the signal easier for the model to extract (log transforms, interaction terms, embeddings).
- Split into train / validation / test — the test set is held out until final evaluation; validation guides hyperparameter choices.
- Choose and train a model — select an algorithm appropriate for the task type and fit it on the training set.
- Evaluate on the validation set — measure performance using the task's metric (RMSE for regression, accuracy or F1 for classification).
- Tune hyperparameters — adjust regularization strength, learning rate, tree depth, or other knobs based on validation performance.
- Final evaluation on the held-out test set — one shot at an unbiased performance estimate; touching this set earlier invalidates it.
- Deploy and monitor — serve the model in production, track data drift and performance degradation, retrain when needed.
<line x1="360" y1="50" x2="120" y2="100" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="360" y1="50" x2="360" y2="100" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="360" y1="50" x2="550" y2="100" stroke="#94a3b8" stroke-width="1.5"/>
<line x1="360" y1="50" x2="660" y2="100" stroke="#94a3b8" stroke-width="1.5"/>
<rect x="30" y="100" width="180" height="36" rx="6" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="120" y="123" text-anchor="middle" font-size="13" fill="#334155">Supervised</text>
<rect x="270" y="100" width="180" height="36" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="360" y="123" text-anchor="middle" font-size="13" fill="#334155">Unsupervised</text>
<rect x="490" y="100" width="160" height="36" rx="6" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="570" y="123" text-anchor="middle" font-size="13" fill="#334155">Semi-Supervised</text>
<rect x="610" y="100" width="100" height="36" rx="6" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1.5"/>
<text x="660" y="123" text-anchor="middle" font-size="13" fill="#334155">RL</text>
<line x1="80" y1="136" x2="60" y2="186" stroke="#94a3b8" stroke-width="1.2"/>
<line x1="160" y1="136" x2="180" y2="186" stroke="#94a3b8" stroke-width="1.2"/>
<rect x="10" y="186" width="100" height="34" rx="5" fill="#eff6ff" stroke="#93c5fd" stroke-width="1.2"/>
<text x="60" y="208" text-anchor="middle" font-size="12" fill="#334155">Regression</text>
<rect x="120" y="186" width="100" height="34" rx="5" fill="#eff6ff" stroke="#93c5fd" stroke-width="1.2"/>
<text x="170" y="208" text-anchor="middle" font-size="12" fill="#334155">Classification</text>
<text x="60" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">house price, temp</text>
<text x="170" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">spam, digit</text>
<line x1="310" y1="136" x2="290" y2="186" stroke="#94a3b8" stroke-width="1.2"/>
<line x1="360" y1="136" x2="360" y2="186" stroke="#94a3b8" stroke-width="1.2"/>
<line x1="410" y1="136" x2="430" y2="186" stroke="#94a3b8" stroke-width="1.2"/>
<rect x="230" y="186" width="100" height="34" rx="5" fill="#fffbeb" stroke="#fcd34d" stroke-width="1.2"/>
<text x="280" y="208" text-anchor="middle" font-size="12" fill="#334155">Clustering</text>
<rect x="335" y="186" width="110" height="34" rx="5" fill="#fffbeb" stroke="#fcd34d" stroke-width="1.2"/>
<text x="390" y="208" text-anchor="middle" font-size="11" fill="#334155">Dim. Reduction</text>
<rect x="410" y="186" width="110" height="34" rx="5" fill="#fffbeb" stroke="#fcd34d" stroke-width="1.2"/>
<text x="465" y="208" text-anchor="middle" font-size="11" fill="#334155">Anomaly Detect.</text>
<text x="280" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">customer segments</text>
<text x="390" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">PCA, t-SNE</text>
<text x="465" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">fraud, intrusion</text>
<text x="570" y="160" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">medical imaging</text>
<text x="660" y="160" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">games, robotics</text>
<rect x="4" y="50" width="72" height="40" rx="5" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1.5"/>
<text x="40" y="68" text-anchor="middle" font-size="10" fill="#334155">1. Data</text>
<text x="40" y="82" text-anchor="middle" font-size="9" fill="#64748b">EDA</text>
<line x1="76" y1="70" x2="88" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<rect x="88" y="50" width="72" height="40" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="124" y="68" text-anchor="middle" font-size="10" fill="#334155">2. Preprocess</text>
<text x="124" y="82" text-anchor="middle" font-size="9" fill="#64748b">clean, scale</text>
<line x1="160" y1="70" x2="172" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<rect x="172" y="50" width="76" height="40" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="210" y="68" text-anchor="middle" font-size="10" fill="#334155">3. Feature Eng.</text>
<text x="210" y="82" text-anchor="middle" font-size="9" fill="#64748b">transform</text>
<line x1="248" y1="70" x2="260" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<rect x="260" y="50" width="68" height="40" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
<text x="294" y="68" text-anchor="middle" font-size="10" fill="#334155">4. Split</text>
<text x="294" y="82" text-anchor="middle" font-size="9" fill="#64748b">train/val/test</text>
<line x1="328" y1="70" x2="340" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<rect x="340" y="50" width="72" height="40" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="376" y="68" text-anchor="middle" font-size="10" fill="#334155">5. Train</text>
<text x="376" y="82" text-anchor="middle" font-size="9" fill="#64748b">fit model</text>
<line x1="412" y1="70" x2="424" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<rect x="424" y="50" width="72" height="40" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/>
<text x="460" y="68" text-anchor="middle" font-size="10" fill="#334155">6. Evaluate</text>
<text x="460" y="82" text-anchor="middle" font-size="9" fill="#64748b">RMSE, F1</text>
<line x1="496" y1="70" x2="508" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<polygon points="534,50 558,70 534,90 510,70" fill="#fffbeb" stroke="#f59e0b" stroke-width="1.5"/>
<text x="534" y="68" text-anchor="middle" font-size="9" fill="#334155">Good</text>
<text x="534" y="80" text-anchor="middle" font-size="9" fill="#334155">enough?</text>
<line x1="558" y1="70" x2="570" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<text x="563" y="65" text-anchor="middle" font-size="9" fill="#22c55e">yes</text>
<rect x="570" y="50" width="60" height="40" rx="5" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="600" y="68" text-anchor="middle" font-size="10" fill="#334155">8. Test Set</text>
<text x="600" y="82" text-anchor="middle" font-size="9" fill="#64748b">final eval</text>
<line x1="630" y1="70" x2="642" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<rect x="642" y="50" width="68" height="40" rx="5" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/>
<text x="676" y="68" text-anchor="middle" font-size="10" fill="#334155">9. Deploy</text>
<text x="676" y="82" text-anchor="middle" font-size="9" fill="#64748b">monitor</text>
<path d="M534,90 Q534,130 376,130 Q376,95 376,90" fill="none" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/>
<text x="455" y="145" text-anchor="middle" font-size="9" fill="#64748b">7. Tune hyperparameters (no)</text>
Learning Paradigm Comparison
| Paradigm | Labels Needed | Goal | Example Algorithm | Example Task |
|---|---|---|---|---|
| Supervised | Yes (input + output) | Learn | Linear Regression | Predict house price |
| Unsupervised | No | Find structure in | K-Means | Customer segmentation |
| Semi-Supervised | Partial (few labels) | Leverage unlabelled data | Label Propagation | Medical image classification |
| Reinforcement | Reward signal | Learn optimal policy | Q-Learning | Game playing |
Parametric vs Non-Parametric
| Property | Parametric | Non-Parametric |
|---|---|---|
| Parameters | Fixed count ( for linear) | Grows with training data |
| Training complexity | Lower — optimize a fixed set | Higher — stores or considers all points |
| Inference speed | Fast — just dot product | Slow — search or kernel over training set |
| Memory usage | Low — parameters only | High — must retain training data |
| Flexibility | Constrained to model family | Arbitrary decision boundaries |
| Example models | Linear Regression, Logistic Regression, Neural Networks | KNN, Kernel SVM, Gaussian Processes |
Related Concepts and Honest Limitations
The taxonomy above is a teaching convenience, not a rigid taxonomy — the boundaries blur in practice. Semi-supervised learning is technically supervised with a missing-data model for unlabelled examples. Self-supervised pretraining followed by supervised fine-tuning doesn't fit neatly into any single box. Representation learning cuts across all paradigms: you can learn representations supervised (ResNet), unsupervised (VAE), or self-supervised (BERT).
The honest limitation of this framing: knowing which paradigm a problem belongs to doesn't tell you which algorithm to pick. A regression problem with 50 features and 500 samples calls for a different model than one with 50 features and 50 million samples — even though both are "supervised regression." The workflow diagram above is where that decision actually happens: step 5 requires understanding your data size, feature types, interpretability requirements, and inference latency constraints.
Test Your Understanding
-
A model for predicting tomorrow's stock price from today's volume, price, and 20-day moving average — which paradigm is it, and why? Which sub-type of that paradigm?
-
You have 10,000 customer transactions with no fraud labels, but you know fraud accounts for roughly 0.1% of transactions. What learning paradigm would you start with, and what performance measure would you use?
-
K-Nearest Neighbors stores all training points and predicts by majority vote among the closest. Is KNN parametric or non-parametric? What happens to its inference time as training set size doubles?
-
Self-supervised learning is sometimes called "unsupervised" in older papers. Based on the distinction made above, why is that label inaccurate?
-
In the ML workflow, why is it a mistake to look at the test set before step 8 — even just to check the distribution?