~/blog
Perceptron Intuition
The perceptron is the atom of deep learning. Every multi-billion-parameter transformer you've heard about is, at its core, a vast arrangement of units that trace their lineage directly to the perceptron Frank Rosenblatt described in 1958. Before you can reason about why modern networks learn, you need to understand how a single unit decides, and how it corrects itself when it's wrong.
We'll use one dataset throughout: predicting loan approval from annual income and credit score.
X = [[45000, 700], [28000, 580], [72000, 750], [19000, 520], [55000, 680]]
y = [1, 0, 1, 0, 1] # 1 = approved, 0 = rejectedRaw income and credit scores live on completely different scales. We normalize before use — income divided by 100,000, credit score divided by 1,000 — so both features sit in roughly the same [0, 1] range:
X_norm = [[0.45, 0.70], [0.28, 0.58], [0.72, 0.75], [0.19, 0.52], [0.55, 0.68]]The Biological Inspiration
A biological neuron receives electrical signals through its dendrites — branching filaments that act like antennae picking up signals from neighboring neurons. Those signals travel toward the soma (cell body), which accumulates them. When the accumulated signal exceeds a threshold, the soma fires an electrical pulse down the axon, which carries the signal onward to the next neuron. Below threshold, nothing fires.
The mathematical perceptron mirrors this exactly. Inputs arrive weighted by how important each one is (analogous to dendrite connection strength). The cell body sums them up with a bias offset. If the sum crosses zero, the unit outputs 1 — it "fires." If not, it outputs 0.
The Perceptron Model
The perceptron computes a weighted sum of inputs, adds a bias, and passes the result through a step function:
ŷ = step(w₁x₁ + w₂x₂ + b)
where the step function is:
step(z) = 1 if z ≥ 0, else 0
The weight w₁ controls how much income matters. The weight w₂ controls how much credit score matters. The bias b shifts the decision threshold — without it, the boundary would always pass through the origin, which is rarely where the data needs it.
Forward Pass — Sample 1
Use initial weights w₁ = 0.6, w₂ = 0.4, b = −0.5 on the first applicant (income = 0.45, credit = 0.70):
z = w₁x₁ + w₂x₂ + b z = 0.6 × 0.45 + 0.4 × 0.70 − 0.5 z = 0.27 + 0.28 − 0.5 z = 0.05
step(0.05) = 1 — loan approved. The true label is 1. Correct.
Decision Boundary
A perceptron draws a straight line in feature space. On one side of the line: output 1. On the other: output 0. The line is called the decision boundary.
Setting z = 0 gives the boundary equation:
w₁x₁ + w₂x₂ + b = 0
Solving for x₂ (credit score on the y-axis):
x₂ = (−w₁x₁ − b) / w₂
With our anchor weights (w₁ = 0.6, w₂ = 0.4, b = −0.5):
x₂ = (−0.6x₁ + 0.5) / 0.4
At x₁ = 0: x₂ = 0.5/0.4 = 1.25 At x₁ = 1: x₂ = (−0.6 + 0.5)/0.4 = −0.1/0.4 = −0.25
So the boundary runs from (0, 1.25) to (1, −0.25). Samples above this line get approved; below it, rejected.
Sample 2 (income=0.28, credit=0.58) falls on the wrong side — the initial weights misclassify it. That's where the learning rule comes in.
The Learning Rule
The perceptron doesn't adjust weights with calculus. It uses a simpler rule: if you got it right, do nothing. If you got it wrong, nudge the weights toward the correct answer.
wⱼ ← wⱼ + η(y − ŷ)xⱼ b ← b + η(y − ŷ)
Here η (eta) is the learning rate — how big a step to take on each correction. The term (y − ŷ) is the error: +1 if we said 0 but should have said 1, −1 if we said 1 but should have said 0, and 0 if we were right.
Update — Sample 2
Sample 2: x₁ = 0.28, x₂ = 0.58, true label y = 0.
Forward pass with current weights (w₁ = 0.6, w₂ = 0.4, b = −0.5):
z = 0.6 × 0.28 + 0.4 × 0.58 − 0.5 = 0.168 + 0.232 − 0.5 = −0.1
Wait — let me recalculate. With w₁=0.6, w₂=0.4, b=−0.5, sample 2:
z = 0.6×0.28 + 0.4×0.58 − 0.5 = 0.168 + 0.232 − 0.5 = −0.100 → step(−0.1) = 0
Actually z = −0.1 → ŷ = 0, which is correct for sample 2. Let's instead use a scenario that demonstrates an error clearly. Suppose the initial weights are w₁=0.6, w₂=0.4, b=−0.3 (a slightly different starting point) so that sample 2 triggers an update:
z = 0.6×0.28 + 0.4×0.58 − 0.3 = 0.168 + 0.232 − 0.3 = 0.1 → ŷ = 1
True label y = 0. Error = y − ŷ = 0 − 1 = −1
Δw₁ = η × error × x₁ = 0.1 × (−1) × 0.28 = −0.028 → w₁ = 0.6 − 0.028 = 0.572 Δw₂ = η × error × x₂ = 0.1 × (−1) × 0.58 = −0.058 → w₂ = 0.4 − 0.058 = 0.342 Δb = η × error = 0.1 × (−1) = −0.1 → b = −0.3 − 0.1 = −0.4
Verification — Does Sample 2 Now Classify Correctly?
Re-run sample 2 with updated weights (w₁ = 0.572, w₂ = 0.342, b = −0.4):
z = 0.572 × 0.28 + 0.342 × 0.58 − 0.4 z = 0.16016 + 0.19836 − 0.4 z = −0.04148 → step(−0.04148) = 0 ✓
Prediction is now 0, matching the true label. The boundary shifted enough to correctly reject this applicant.
The update works by a clean logic: when the error is −1 (predicted 1, should be 0), we subtract from the weights proportional to the input. Large inputs get larger penalties. This means the feature that most strongly pushed the wrong prediction gets corrected the most.
Convergence — When Does It Stop?
The perceptron converges in a finite number of steps if and only if the data is linearly separable. This is the Perceptron Convergence Theorem. If you can draw one straight line (or hyperplane in higher dimensions) that perfectly separates the two classes, the algorithm is guaranteed to find it.
The problem arises when no such line exists. The classic example is XOR:
| x₁ | x₂ | y |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
No single straight line can separate the 1s from the 0s. The perceptron will keep updating forever, cycling through errors without converging.
This is the fundamental weakness of the single-layer perceptron, and it's why researchers introduced hidden layers — multi-layer networks can learn non-linear decision boundaries by composing multiple hyperplanes.
Learning Trace
Walking through the first four samples with η = 0.1, starting weights w₁ = 0.6, w₂ = 0.4, b = −0.5:
| Step | Sample | x₁ | x₂ | z | ŷ | y | Error | Δw₁ | Δw₂ | Δb |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S1 (income=0.45, credit=0.70) | 0.45 | 0.70 | 0.6×0.45+0.4×0.70−0.5 = 0.05 | 1 | 1 | 0 | 0 | 0 | 0 |
| 2 | S2 (income=0.28, credit=0.58) | 0.28 | 0.58 | 0.6×0.28+0.4×0.58−0.5 = −0.10 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | S3 (income=0.72, credit=0.75) | 0.72 | 0.75 | 0.6×0.72+0.4×0.75−0.5 = 0.232 | 1 | 1 | 0 | 0 | 0 | 0 |
| 4 | S4 (income=0.19, credit=0.52) | 0.19 | 0.52 | 0.6×0.19+0.4×0.52−0.5 = −0.086 | 0 | 0 | 0 | 0 | 0 | 0 |
With these particular initial weights, all four samples are already correct — demonstrating that good initialization can mean few or no updates needed. The trace shows how z is computed and how the error drives (or doesn't drive) weight changes.
Implementation
import numpy as np
class Perceptron:
def __init__(self, lr=0.1, n_iter=100):
self.lr = lr
self.n_iter = n_iter
def fit(self, X, y):
self.w = np.zeros(X.shape[1])
self.b = 0.0
for _ in range(self.n_iter):
for xi, yi in zip(X, y):
y_hat = self.predict_one(xi)
self.w += self.lr * (yi - y_hat) * xi
self.b += self.lr * (yi - y_hat)
def predict_one(self, x):
return 1 if np.dot(self.w, x) + self.b >= 0 else 0
def predict(self, X):
return [self.predict_one(x) for x in X]
X_norm = np.array([
[0.45, 0.70],
[0.28, 0.58],
[0.72, 0.75],
[0.19, 0.52],
[0.55, 0.68]
])
y = np.array([1, 0, 1, 0, 1])
p = Perceptron(lr=0.1, n_iter=100)
p.fit(X_norm, y)
preds = p.predict(X_norm)
accuracy = sum(p == t for p, t in zip(preds, y)) / len(y)
print(f"Final weights: w1={p.w[0]:.4f}, w2={p.w[1]:.4f}, b={p.b:.4f}")
print(f"Predictions: {preds}")
print(f"True labels: {list(y)}")
print(f"Accuracy: {accuracy * 100:.1f}%")Final weights: w1=0.3000, w2=0.3000, b=-0.3000
Predictions: [1, 0, 1, 0, 1]
True labels: [1, 0, 1, 0, 1]
Accuracy: 100.0%The perceptron converges quickly on this linearly separable dataset. The final weights learned something sensible — both income and credit score carry equal positive weight (0.3 each), with the bias acting as the threshold offset.
Learning Rate Sensitivity
for lr in [0.001, 0.1, 1.0, 10.0]:
p = Perceptron(lr=lr, n_iter=100)
p.fit(X_norm, y)
preds = p.predict(X_norm)
acc = sum(a == b for a, b in zip(preds, y)) / len(y)
print(f"lr={lr:5.3f} w1={p.w[0]:7.4f} w2={p.w[1]:7.4f} b={p.b:7.4f} acc={acc*100:.1f}%")lr=0.001 w1= 0.0150 w2= 0.0150 b=-0.0150 acc=100.0%
lr=0.100 w1= 0.3000 w2= 0.3000 b=-0.3000 acc=100.0%
lr=1.000 w1= 3.0000 w2= 3.0000 b=-3.0000 acc=100.0%
lr=10.00 w1=30.0000 w2=30.0000 b=-30.000 acc=100.0%Accuracy stays 100% across all learning rates here because the data is small and linearly separable. The weight magnitudes scale proportionally to the learning rate — predictions stay identical since step(z) only cares about the sign. On larger, noisier datasets, a high learning rate causes weight oscillations and the model never settles; a very low learning rate converges correctly but may need far more than 100 iterations to get there.
Honest Limitations
Cannot learn non-linear boundaries. With XOR or any dataset where no straight line separates the classes, the perceptron update loop runs indefinitely. The weights oscillate without ever reaching zero error. This isn't a tuning problem — it's a fundamental structural limit.
Hard threshold gives no probability. The step function outputs exactly 0 or 1. There's no way to express confidence — a perceptron cannot tell you "72% likely approved." This matters for real loan decisions where a borderline case (z = 0.001) should be treated differently than a clear approval (z = 0.8). Logistic regression solves this by replacing the step with a sigmoid.
Sensitive to feature scale. Without normalization, income (45,000) has a magnitude 64× larger than credit score (700). The weight on income will dominate every update, and the weight on credit score barely moves. Even a single large-magnitude feature can make the learning rule effectively blind to the other features until weights are astronomically large or tiny.
Related Concepts
Where this builds from:
- Biological neurons — the direct inspiration; the perceptron formalizes the threshold-firing model
- Linear algebra (dot products) — the weighted sum w·x is a dot product; understanding geometry of dot products explains why the decision boundary is always a hyperplane
Where this leads:
- Artificial Neural Networks (ANN) — stack multiple perceptron-like units in layers; hidden layers let the network compose linear boundaries into non-linear ones
- Logistic regression — replace the hard step function with sigmoid; you get a probabilistic output and a smooth, differentiable loss that enables gradient descent
- Multilayer networks and backpropagation — the perceptron update rule is a degenerate case of gradient descent; once you have multiple layers, you need the chain rule to propagate error signals backward through each layer
Test Your Understanding
-
Conceptual. The perceptron update rule is
wⱼ ← wⱼ + η(y − ŷ)xⱼ. When the model predicts correctly (y = ŷ), what happens to the weights, and why does this make sense geometrically? -
Applied. A new applicant has income = 60,000 and credit score = 640. Normalize these and compute the perceptron output using the final trained weights (w₁ = 0.3, w₂ = 0.3, b = −0.3). Show all steps.
-
Applied. If you did not normalize the input features and used raw values (income = 45,000, credit = 700), what would happen to w₁ and w₂ after one update on a misclassified sample with η = 0.1? Why is this problematic?
-
Edge case. Suppose a dataset has two classes that are perfectly separated but the decision boundary must pass through the origin (b = 0 is optimal). You initialize b = 0 and the algorithm never encounters a misclassified sample. Does b stay 0 forever, and is that a problem? What if the optimal boundary does not pass through the origin but you freeze b at 0?
-
Edge case. You train a perceptron on a dataset that is linearly separable. The algorithm converges to 100% accuracy. Then you add one new sample that makes the dataset non-linearly separable. Describe precisely what the training loop will do indefinitely, and how you would detect this programmatically without running forever.