Back to blog
← View series: machine learning

~/blog

Types of Machine Learning

Jun 25, 202610 min readBy Mohammed Vasim
Machine LearningAIData Science

Every ML algorithm in this series fits into a broader taxonomy. Before touching a single equation, you need a clear map of what kind of problem you're solving — because choosing the wrong paradigm (clustering when you have labels, regression when your target is categorical) wastes weeks before you realize the mistake.

What Machine Learning Actually Is

Tom Mitchell's 1997 definition remains the most precise: a computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at T, as measured by P, improves with experience E.

That triplet — task, experience, performance — is what distinguishes ML from rule-based programming. In a rule-based spam filter, a human writes if "free money" in subject: mark_as_spam. The rule is explicit and brittle. In an ML spam filter, an algorithm processes thousands of labelled emails (experience), learns which word patterns correlate with spam (task), and minimizes misclassification rate (performance). The rules emerge from data; no human encodes them.

Three ingredients are always present: data (the experience), a learning algorithm (the mechanism), and a performance measure (what "better" means). Remove any one and you don't have machine learning — you have either a lookup table, a random process, or an undefined objective.

Supervised Learning

Supervised learning is the paradigm where training data consists of labelled input-output pairs . The goal is to learn a function that maps inputs to outputs. At test time, you hand the model an unseen and it predicts .

The data flow is: raw data → feature matrix → model → prediction → compare to true label → compute error → update model. Every supervised algorithm repeats this loop.

Regression — when is continuous. Predicting the sale price of a house from its square footage, number of bedrooms, and neighborhood. The model outputs a real number, and the error is measured in original units (dollars off, degrees off).

Classification — when is a discrete class. Classifying an email as spam or not-spam. Recognizing a handwritten digit (0–9). The model outputs a class label or a probability distribution over classes.

Key algorithms in supervised learning: Linear Regression, Logistic Regression, Decision Trees, SVMs, Neural Networks, KNN. Each appears later in this series with full derivations.

Unsupervised Learning

Unsupervised learning removes the labels. You have a feature matrix but no corresponding . The goal shifts from "learn to predict " to "find structure in ."

Clustering groups similar data points together without being told what the groups should be. K-Means applied to customer purchase histories might discover three clusters — frequent buyers, seasonal shoppers, and one-time purchasers — without anyone defining those categories in advance.

Dimensionality reduction compresses a high-dimensional dataset into fewer dimensions while preserving as much structure as possible. PCA reduces 100 gene-expression features to 2 principal components that capture most of the variance. t-SNE and UMAP do the same for visualization.

Density estimation and anomaly detection — learn the distribution of normal data, then flag points that don't fit. A transaction that looks nothing like the rest of your history is anomalous.

The fundamental difference from supervised learning: there is no ground truth to compare against. "Better" is measured by internal criteria (cluster cohesion, reconstruction error, likelihood) rather than a human-provided label.

Semi-Supervised and Self-Supervised Learning

Semi-supervised learning uses a small labelled set alongside a large unlabelled set. In medical imaging, labelling an MRI scan requires a radiologist — expensive and slow. You might have 200 labelled scans and 50,000 unlabelled ones. Semi-supervised methods use the unlabelled scans to improve the model trained on the 200.

Self-supervised learning is subtler: labels are generated automatically from the data itself, without human annotation. BERT predicts masked tokens in a sentence — the "label" for each masked word is the word that was there. SimCLR learns image representations by predicting whether two augmented versions of an image are the same image. The key insight is that the data contains its own supervisory signal.

This distinction matters because most real-world data is unlabelled. Self-supervised pretraining is what makes large language models and vision transformers work at scale.

Reinforcement Learning

Reinforcement learning operates on a different structure entirely. An agent acts in an environment, receives a reward (or penalty) signal, and learns a policy — a mapping from states to actions — that maximizes cumulative reward.

Key terms: state (what the environment looks like right now), action (what the agent can do), reward (feedback after each action), policy (the strategy being learned). AlphaGo learned to play Go by playing millions of games against itself. Robot locomotion controllers learn to walk by receiving positive reward for forward progress and negative reward for falling.

RL is not covered in this series — it requires dedicated treatment of Markov decision processes and temporal-difference learning. For a rigorous introduction, Sutton and Barto's Reinforcement Learning: An Introduction (freely available online) is the standard reference.

Parametric vs Non-Parametric Models

This distinction cuts across all paradigms and matters for choosing between algorithms.

Parametric models have a fixed number of parameters regardless of how much training data you have. Linear regression has exactly parameters () — feeding it 100 samples or 10 million samples doesn't change that count. Parameters are learned during training; at inference, you only need the parameters, not the training data. Fast at inference, interpretable, but constrained to whatever family of functions the parameters can express.

Non-parametric models let complexity grow with data. KNN stores every training point — with 1 million training examples, you carry 1 million data points to inference time. Kernel SVMs are similar. Non-parametric models make fewer assumptions and can represent more complex functions, but they're memory-intensive and slow at inference.

Linear regression — the subject of this series — is supervised, parametric, and regression-typed. The next post derives exactly what those parameters mean geometrically.

The General ML Workflow

Every ML project, regardless of algorithm, follows the same nine stages:

  1. Collect and understand data — EDA reveals distributions, missing values, outliers, and target imbalance before any model is built.
  2. Preprocess — handle missing values, encode categorical variables, scale numerical features so no single feature dominates by magnitude.
  3. Feature engineering — create or transform inputs to make the signal easier for the model to extract (log transforms, interaction terms, embeddings).
  4. Split into train / validation / test — the test set is held out until final evaluation; validation guides hyperparameter choices.
  5. Choose and train a model — select an algorithm appropriate for the task type and fit it on the training set.
  6. Evaluate on the validation set — measure performance using the task's metric (RMSE for regression, accuracy or F1 for classification).
  7. Tune hyperparameters — adjust regularization strength, learning rate, tree depth, or other knobs based on validation performance.
  8. Final evaluation on the held-out test set — one shot at an unbiased performance estimate; touching this set earlier invalidates it.
  9. Deploy and monitor — serve the model in production, track data drift and performance degradation, retrain when needed.
Machine Learning <line x1="360" y1="50" x2="120" y2="100" stroke="#94a3b8" stroke-width="1.5"/> <line x1="360" y1="50" x2="360" y2="100" stroke="#94a3b8" stroke-width="1.5"/> <line x1="360" y1="50" x2="550" y2="100" stroke="#94a3b8" stroke-width="1.5"/> <line x1="360" y1="50" x2="660" y2="100" stroke="#94a3b8" stroke-width="1.5"/> <rect x="30" y="100" width="180" height="36" rx="6" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="120" y="123" text-anchor="middle" font-size="13" fill="#334155">Supervised</text> <rect x="270" y="100" width="180" height="36" rx="6" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="360" y="123" text-anchor="middle" font-size="13" fill="#334155">Unsupervised</text> <rect x="490" y="100" width="160" height="36" rx="6" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="570" y="123" text-anchor="middle" font-size="13" fill="#334155">Semi-Supervised</text> <rect x="610" y="100" width="100" height="36" rx="6" fill="#f1f5f9" stroke="#94a3b8" stroke-width="1.5"/> <text x="660" y="123" text-anchor="middle" font-size="13" fill="#334155">RL</text> <line x1="80" y1="136" x2="60" y2="186" stroke="#94a3b8" stroke-width="1.2"/> <line x1="160" y1="136" x2="180" y2="186" stroke="#94a3b8" stroke-width="1.2"/> <rect x="10" y="186" width="100" height="34" rx="5" fill="#eff6ff" stroke="#93c5fd" stroke-width="1.2"/> <text x="60" y="208" text-anchor="middle" font-size="12" fill="#334155">Regression</text> <rect x="120" y="186" width="100" height="34" rx="5" fill="#eff6ff" stroke="#93c5fd" stroke-width="1.2"/> <text x="170" y="208" text-anchor="middle" font-size="12" fill="#334155">Classification</text> <text x="60" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">house price, temp</text> <text x="170" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">spam, digit</text> <line x1="310" y1="136" x2="290" y2="186" stroke="#94a3b8" stroke-width="1.2"/> <line x1="360" y1="136" x2="360" y2="186" stroke="#94a3b8" stroke-width="1.2"/> <line x1="410" y1="136" x2="430" y2="186" stroke="#94a3b8" stroke-width="1.2"/> <rect x="230" y="186" width="100" height="34" rx="5" fill="#fffbeb" stroke="#fcd34d" stroke-width="1.2"/> <text x="280" y="208" text-anchor="middle" font-size="12" fill="#334155">Clustering</text> <rect x="335" y="186" width="110" height="34" rx="5" fill="#fffbeb" stroke="#fcd34d" stroke-width="1.2"/> <text x="390" y="208" text-anchor="middle" font-size="11" fill="#334155">Dim. Reduction</text> <rect x="410" y="186" width="110" height="34" rx="5" fill="#fffbeb" stroke="#fcd34d" stroke-width="1.2"/> <text x="465" y="208" text-anchor="middle" font-size="11" fill="#334155">Anomaly Detect.</text> <text x="280" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">customer segments</text> <text x="390" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">PCA, t-SNE</text> <text x="465" y="238" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">fraud, intrusion</text> <text x="570" y="160" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">medical imaging</text> <text x="660" y="160" text-anchor="middle" font-size="10" fill="#64748b" font-style="italic">games, robotics</text> <rect x="4" y="50" width="72" height="40" rx="5" fill="#f1f5f9" stroke="#e2e8f0" stroke-width="1.5"/> <text x="40" y="68" text-anchor="middle" font-size="10" fill="#334155">1. Data</text> <text x="40" y="82" text-anchor="middle" font-size="9" fill="#64748b">EDA</text> <line x1="76" y1="70" x2="88" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <rect x="88" y="50" width="72" height="40" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="124" y="68" text-anchor="middle" font-size="10" fill="#334155">2. Preprocess</text> <text x="124" y="82" text-anchor="middle" font-size="9" fill="#64748b">clean, scale</text> <line x1="160" y1="70" x2="172" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <rect x="172" y="50" width="76" height="40" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="210" y="68" text-anchor="middle" font-size="10" fill="#334155">3. Feature Eng.</text> <text x="210" y="82" text-anchor="middle" font-size="9" fill="#64748b">transform</text> <line x1="248" y1="70" x2="260" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <rect x="260" y="50" width="68" height="40" rx="5" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/> <text x="294" y="68" text-anchor="middle" font-size="10" fill="#334155">4. Split</text> <text x="294" y="82" text-anchor="middle" font-size="9" fill="#64748b">train/val/test</text> <line x1="328" y1="70" x2="340" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <rect x="340" y="50" width="72" height="40" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="376" y="68" text-anchor="middle" font-size="10" fill="#334155">5. Train</text> <text x="376" y="82" text-anchor="middle" font-size="9" fill="#64748b">fit model</text> <line x1="412" y1="70" x2="424" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <rect x="424" y="50" width="72" height="40" rx="5" fill="#fef3c7" stroke="#f59e0b" stroke-width="1.5"/> <text x="460" y="68" text-anchor="middle" font-size="10" fill="#334155">6. Evaluate</text> <text x="460" y="82" text-anchor="middle" font-size="9" fill="#64748b">RMSE, F1</text> <line x1="496" y1="70" x2="508" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <polygon points="534,50 558,70 534,90 510,70" fill="#fffbeb" stroke="#f59e0b" stroke-width="1.5"/> <text x="534" y="68" text-anchor="middle" font-size="9" fill="#334155">Good</text> <text x="534" y="80" text-anchor="middle" font-size="9" fill="#334155">enough?</text> <line x1="558" y1="70" x2="570" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <text x="563" y="65" text-anchor="middle" font-size="9" fill="#22c55e">yes</text> <rect x="570" y="50" width="60" height="40" rx="5" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="600" y="68" text-anchor="middle" font-size="10" fill="#334155">8. Test Set</text> <text x="600" y="82" text-anchor="middle" font-size="9" fill="#64748b">final eval</text> <line x1="630" y1="70" x2="642" y2="70" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <rect x="642" y="50" width="68" height="40" rx="5" fill="#dcfce7" stroke="#22c55e" stroke-width="1.5"/> <text x="676" y="68" text-anchor="middle" font-size="10" fill="#334155">9. Deploy</text> <text x="676" y="82" text-anchor="middle" font-size="9" fill="#64748b">monitor</text> <path d="M534,90 Q534,130 376,130 Q376,95 376,90" fill="none" stroke="#94a3b8" stroke-width="1.2" marker-end="url(#arrow)"/> <text x="455" y="145" text-anchor="middle" font-size="9" fill="#64748b">7. Tune hyperparameters (no)</text>

Learning Paradigm Comparison

ParadigmLabels NeededGoalExample AlgorithmExample Task
SupervisedYes (input + output)Learn Linear RegressionPredict house price
UnsupervisedNoFind structure in K-MeansCustomer segmentation
Semi-SupervisedPartial (few labels)Leverage unlabelled dataLabel PropagationMedical image classification
ReinforcementReward signalLearn optimal policyQ-LearningGame playing

Parametric vs Non-Parametric

PropertyParametricNon-Parametric
ParametersFixed count ( for linear)Grows with training data
Training complexityLower — optimize a fixed setHigher — stores or considers all points
Inference speedFast — just dot productSlow — search or kernel over training set
Memory usageLow — parameters onlyHigh — must retain training data
FlexibilityConstrained to model familyArbitrary decision boundaries
Example modelsLinear Regression, Logistic Regression, Neural NetworksKNN, Kernel SVM, Gaussian Processes

The taxonomy above is a teaching convenience, not a rigid taxonomy — the boundaries blur in practice. Semi-supervised learning is technically supervised with a missing-data model for unlabelled examples. Self-supervised pretraining followed by supervised fine-tuning doesn't fit neatly into any single box. Representation learning cuts across all paradigms: you can learn representations supervised (ResNet), unsupervised (VAE), or self-supervised (BERT).

The honest limitation of this framing: knowing which paradigm a problem belongs to doesn't tell you which algorithm to pick. A regression problem with 50 features and 500 samples calls for a different model than one with 50 features and 50 million samples — even though both are "supervised regression." The workflow diagram above is where that decision actually happens: step 5 requires understanding your data size, feature types, interpretability requirements, and inference latency constraints.

Test Your Understanding

  1. A model for predicting tomorrow's stock price from today's volume, price, and 20-day moving average — which paradigm is it, and why? Which sub-type of that paradigm?

  2. You have 10,000 customer transactions with no fraud labels, but you know fraud accounts for roughly 0.1% of transactions. What learning paradigm would you start with, and what performance measure would you use?

  3. K-Nearest Neighbors stores all training points and predicts by majority vote among the closest. Is KNN parametric or non-parametric? What happens to its inference time as training set size doubles?

  4. Self-supervised learning is sometimes called "unsupervised" in older papers. Based on the distinction made above, why is that label inaccurate?

  5. In the ML workflow, why is it a mistake to look at the test set before step 8 — even just to check the distribution?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment