~/blog
ANN Intuition and Learning
A perceptron can draw exactly one straight line. Churn prediction, fraud detection, and handwriting recognition all require curved, non-linear decision boundaries — patterns no single neuron can express. An artificial neural network (ANN) breaks that ceiling by stacking neurons across layers, where each layer transforms the representation from the previous one until the composed result can fit nearly any continuous pattern the data contains.
The dataset used throughout is a 5-sample customer churn problem. Features are normalized tenure and normalized monthly charges; the label is 1 if the customer churned.
X = [[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]]
# features: [tenure_norm, monthly_charges_norm]
y = [0, 1, 0, 1, 0] # 1 = churnedWhat an ANN Is
An ANN is a directed acyclic graph of neurons organized into layers. Data flows in one direction: input → hidden layers → output. No cycles, no feedback during a single forward pass.
Every neuron in a hidden or output layer computes three things in sequence:
Weighted sum — multiply each incoming value by a learnable weight, sum them, add a learnable bias:
z = w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + bActivation — pass z through a non-linear function. Without this step, stacking layers would collapse to a single linear transformation (a matrix multiplication), no matter how many layers you add. The non-linearity is what makes depth useful.
Output — send the activated value forward to every neuron in the next layer.
For the churn network, the architecture is 2 inputs → 2 hidden neurons (ReLU) → 1 output neuron (sigmoid). The weights are:
- W1 = [[0.5, −0.2], [0.3, 0.8]], b1 = [0.1, −0.1]
- W2 = [[0.6], [−0.4]], b2 = [0.0]
Layer labels and bias annotations make it clear that bias is per-neuron (b1 applies to the hidden layer, b2 to the output), not shared across the network.
Universal Approximation Theorem
A neural network with one hidden layer and enough neurons can approximate any continuous function to arbitrary precision. This is the Universal Approximation Theorem (UAT), and it explains why even shallow networks are so powerful in theory.
The qualifier matters: "enough neurons" can mean exponentially many for complex functions. The theorem guarantees that a solution exists in the weight space — it says nothing about whether gradient descent can find it, or how long training takes.
A practical reading: one hidden layer with a modest neuron count (32–256) handles most smooth, bounded functions well. What requires depth isn't expressiveness per se, but efficiency — a deep network can represent the same function with far fewer parameters than a wide shallow one.
With 2 neurons the network can only bend its output twice — the resulting curve misses the peaks and troughs badly. At 20 neurons the approximation is indistinguishable from the true function at this resolution.
Forward Pass — Sample 1: x = [0.5, 0.1], y = 0
The forward pass is the calculation that converts raw inputs into a prediction. Every training iteration begins here. The first sample is a customer who did not churn (y=0), so a well-trained network should output a value close to 0.
Phase 1 — Hidden Layer
Each hidden neuron computes z = W·x + b, then applies ReLU: a = max(0, z).
Neuron h₁ uses row 0 of W1 = [0.5, −0.2]:
z1₁ = 0.5 × 0.5 + (−0.2) × 0.1 + 0.1
= 0.25 − 0.02 + 0.1
= 0.33
a1₁ = ReLU(0.33) = 0.33Neuron h₂ uses row 1 of W1 = [0.3, 0.8]:
z1₂ = 0.3 × 0.5 + 0.8 × 0.1 + (−0.1)
= 0.15 + 0.08 − 0.1
= 0.13
a1₂ = ReLU(0.13) = 0.13Both pre-activations are positive, so ReLU passes them through unchanged. Had either been negative, ReLU would have zeroed it — creating a sparse, efficient representation.
The hidden layer is now computed. Its outputs [0.33, 0.13] are the learned internal representation of this customer — a compressed signal the output layer will use to produce a churn probability.
Phase 2 — Output Layer
The output neuron uses W2 = [[0.6], [−0.4]] and b2 = 0.0. It takes the hidden activations as inputs:
z2 = 0.6 × 0.33 + (−0.4) × 0.13 + 0.0
= 0.198 − 0.052
= 0.146
ŷ = σ(0.146) = 1 / (1 + e^−0.146)
= 1 / (1 + 0.8641)
≈ 0.536The sigmoid squashes z2 into (0, 1), making it interpretable as a probability. A prediction of 0.536 means the network gives this customer a 53.6% chance of churning — barely above random, and badly wrong since the true label is y=0.
Phase 3 — Loss
The prediction ŷ=0.536 will be compared against the true label y=0 using Binary Cross-Entropy (BCE):
BCE = −[y · log(ŷ) + (1 − y) · log(1 − ŷ)]
= −[0 · log(0.536) + 1 · log(1 − 0.536)]
= −log(0.464)
≈ 0.767A loss of 0.767 is high — a perfect prediction would give BCE ≈ 0. The network has assigned 53.6% probability to churn when the correct answer is 0% (no churn). This error is what backpropagation will use to compute weight updates.
Forward Pass at a Glance
| Phase | Computation | Values | Result |
|---|---|---|---|
| Hidden z1₁ | w·x+b | 0.5×0.5 + (−0.2)×0.1 + 0.1 | 0.33 |
| Hidden a1₁ | ReLU(z1₁) | ReLU(0.33) | 0.33 |
| Hidden z1₂ | w·x+b | 0.3×0.5 + 0.8×0.1 − 0.1 | 0.13 |
| Hidden a1₂ | ReLU(z1₂) | ReLU(0.13) | 0.13 |
| Output z2 | w·a+b | 0.6×0.33 + (−0.4)×0.13 + 0.0 | 0.146 |
| Output ŷ | σ(z2) | σ(0.146) = 1/(1+e^−0.146) | 0.536 |
| Loss | BCE | −log(1 − 0.536) = −log(0.464) | 0.767 |
The Code
import numpy as np
def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1 / (1 + np.exp(-z))
X = np.array([[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]])
y = np.array([0, 1, 0, 1, 0])
W1 = np.array([[0.5, -0.2], [0.3, 0.8]])
b1 = np.array([0.1, -0.1])
W2 = np.array([[0.6], [-0.4]])
b2 = np.array([0.0])
# Forward pass
Z1 = X @ W1.T + b1
A1 = relu(Z1)
Z2 = A1 @ W2 + b2
y_hat = sigmoid(Z2).flatten()
loss = -np.mean(y * np.log(y_hat + 1e-8) + (1-y) * np.log(1-y_hat + 1e-8))
print("Predictions:", np.round(y_hat, 3))
print("BCE Loss:", round(loss, 4))Predictions: [0.536 0.478 0.501 0.491 0.521]
BCE Loss: 0.7302Every prediction sits near 0.5 — the network is essentially guessing. Sample 1 should score high (it churned, y=1) but gets 0.478; sample 0 should score low (no churn, y=0) but gets 0.536. The overall BCE of 0.7302 is worse than a constant 0.5 predictor would achieve on a balanced dataset (which would give BCE ≈ 0.693). The network has not learned yet — these are initial weights, not trained ones.
What Learning Means
Training is the search for values of W1, b1, W2, b2 that minimize the average BCE loss across all five training samples. The parameter space for this network has 2×2 + 2 + 2×1 + 1 = 9 dimensions. Visualized as a 2D cross-section (holding all but one weight fixed), the loss surface looks like a bowl: high everywhere the weights are wrong, low near the minimum.
Each training iteration moves the current position one step in the direction that reduces loss most quickly — downhill on this surface. The step size is the learning rate. Repeat for enough iterations and the weights converge to values that give good predictions.
The global minimum of this surface is the best the network can do given its architecture. A 2×2×1 network cannot memorize five samples perfectly because it has too few parameters — the minimum will still have non-zero loss.
How the Network Adapts
After the forward pass computes a loss, the network needs to know: how much does each weight contribute to that loss, and in which direction should it change?
The answer flows backward through the network as an error signal — this is backpropagation. For every weight w, the chain rule gives ∂L/∂w: the rate at which the loss changes if w is nudged. A large positive value means "decrease w to reduce loss"; a large negative value means "increase w."
Once those gradients are computed, each weight gets updated by a small step:
w_new = w_old − learning_rate × ∂L/∂wFor example, after one update step with learning rate 0.1, W2[0,0] would move from 0.6 to approximately 0.6 − 0.1 × (∂L/∂W2[0,0]). The exact gradient requires the backpropagation calculation — covered in full in 04-backpropagation.mdx. The conceptual picture here is enough to see why the forward pass matters: every number computed during it (z1₁, a1₁, z1₂, a1₂, z2, ŷ) will be reused during the backward pass to compute the corresponding gradient.
Related Concepts
The perceptron is the degenerate case of an ANN: one neuron, one layer, no hidden representation. Understanding dot products and how activation functions (ReLU, sigmoid) transform their inputs are the necessary prerequisites for the forward pass computation above. From here, the natural next steps are backpropagation (how gradients flow backward), loss functions (other choices beyond BCE and when to use them), gradient descent (step-size strategy), and regularization (preventing overfitting when training data is small).
Honest Limitations
With fewer than approximately 500 training samples, a 2-hidden-layer ANN tends to overfit the training set while performing poorly on new data. At 5 samples, the network in this post cannot generalize at all — use logistic regression or add dropout for small datasets.
The Universal Approximation Theorem guarantees that a solution exists in weight space, not that gradient descent will find it. Local minima, saddle points, and poor weight initialization can all trap training at a suboptimal loss value — especially in networks deeper than two hidden layers.
The forward pass cost scales as O(L·n·h²) per sample, where L is the number of layers, n is the batch size, and h is the hidden dimension. For large networks (h=4096, L=96) on large datasets (n=10⁶), computing even one forward pass requires GPU batching — sequential CPU computation becomes the bottleneck within seconds.
Test Your Understanding
-
Conceptual — Why does removing all activation functions from an ANN reduce it to a linear model, regardless of how many layers it has? What algebraic property causes this?
-
Applied — Compute the hidden activations for sample 2 from the dataset: x = [0.2, 0.3], using the same weights W1 = [[0.5, −0.2], [0.3, 0.8]], b1 = [0.1, −0.1]. What are z1₁, a1₁, z1₂, and a1₂?
-
Applied — The output neuron produced ŷ = 0.536 for x = [0.5, 0.1]. If you increase W2[0,0] from 0.6 to 0.8 (holding everything else fixed), will ŷ increase or decrease? By how much, approximately? Work through the output layer calculation.
-
Conceptual — The Universal Approximation Theorem says one hidden layer suffices. Why do modern architectures use 10, 50, or 100 layers instead of one very wide hidden layer?
-
Edge case — Suppose all neurons in the hidden layer produce z < 0 for every training sample (so all ReLU outputs are 0). What happens to the gradient with respect to W1 during backpropagation, and why can the network never escape this state without intervention?