~/blog
ReLU Activation Function
Sigmoid and tanh both saturate — for large |z|, their derivatives collapse toward zero and gradients die. ReLU sidesteps this entirely for positive values: the gradient is exactly 1. No compression, no attenuation. For z > 0, whatever gradient arrives from the output passes through unchanged.
This one property — constant gradient for positive activations — is why deep networks became trainable.
Anchor z values: {−3, −1, 0, 0.5, 1, 3}.
The Formula
ReLU(z) = max(0, z)
That is the entire function. For positive z, output is z. For negative z, output is 0.
| z | ReLU(z) | ReLU'(z) |
|---|---|---|
| −3 | 0.0000 | 0 |
| −1 | 0.0000 | 0 |
| 0 | 0.0000 | 0* |
| 0.5 | 0.5000 | 1 |
| 1 | 1.0000 | 1 |
| 3 | 3.0000 | 1 |
*Technically undefined at z=0, but set to 0 in practice with no training impact.
Why ReLU Fixed the Vanishing Gradient Problem
Recall from the vanishing gradient post: with sigmoid (max derivative 0.25), a gradient shrinks by a factor of ~0.025 per layer (0.1 × 0.25). After 4 layers it is 64,000× smaller.
With ReLU, for any active neuron (z > 0), the derivative is exactly 1. The gradient is multiplied by 1 — not by 0.25, not by anything less than 1. Gradients can flow through active ReLU neurons without shrinking at all.
Using the same 5-layer network from the vanishing gradient post (w = 0.1 per connection), but replacing sigmoid with ReLU:
If every hidden neuron is active (z > 0), each layer contributes a factor of (w × ReLU'(z)) = 0.1 × 1.0 = 0.1.
δ₃ = 1.0 × 0.1 × 1.0 = 0.1
δ₂ = 0.1 × 0.1 × 1.0 = 0.01
δ₁ = 0.01 × 0.1 × 1.0 = 0.001
First-layer gradient: 0.001 vs sigmoid's 1.56 × 10⁻⁵ — a 64× improvement from just swapping the activation. With larger weights (w = 0.5, which is more typical after He initialization), the gradient at layer 1 would be 0.125 — essentially no attenuation.
The improvement comes entirely from removing the 0.25 multiplier. In a 20-layer network, sigmoid gives 0.25²⁰ = 9 × 10⁻¹³. ReLU gives 1.0²⁰ = 1.0 (times whatever weight product accumulates). That is the difference between learning and not learning.
The Dead ReLU Problem
ReLU's gradient is 1 for z > 0 and exactly 0 for z < 0. A neuron where z is always negative never produces output and never receives a gradient. Its weights never update. The neuron is "dead."
Concrete example: A neuron with weights w = [0.01, 0.01] and bias b = −10. For any input x in the typical [0, 1] range:
z = 0.01 × x₁ + 0.01 × x₂ − 10 ≈ −10 for all reasonable inputs.
ReLU(−10) = 0. ReLU'(−10) = 0. No gradient flows. The weights never move. The bias never moves. The neuron is dead from initialization.
In practice, between 10% and 40% of neurons die during training when the learning rate is too high or weights are poorly initialized. With a learning rate of 0.1 and random normal initialization, it is common to see training plateau mid-epoch as neurons die progressively.
Sparse Activation — A Feature, Not Just a Side Effect
For any given input, roughly half the neurons in a ReLU layer output zero (those with z < 0). This sparse activation has two benefits:
Efficiency: Sparse outputs can be stored and multiplied more efficiently. Matrix multiplications involving many zeros are faster than dense ones.
Interpretability: Different inputs activate different subsets of neurons. This creates a sparse code where the active neurons form a compressed signature for that input — analogous to how the brain uses sparse firing patterns to represent concepts.
Code
import numpy as np
def relu(z):
return np.maximum(0, z)
def relu_grad(z):
return (z > 0).astype(float)
z_vals = np.array([-3.0, -1.0, 0.0, 0.5, 1.0, 3.0])
print(f"{'z':>5} | {'ReLU(z)':>8} | {'ReLU\\'(z)':>9}")
print("-" * 32)
for z in z_vals:
print(f"{z:>5.1f} | {relu(z):>8.4f} | {relu_grad(z):>9.1f}")
# Dead neuron demo
print("\n--- Dead Neuron Demo ---")
rng = np.random.default_rng(42)
w = np.array([0.01, 0.01])
b = -10.0
X_test = rng.uniform(0, 1, (100, 2))
z_dead = X_test @ w + b
active_pct = (z_dead > 0).mean() * 100
print(f"Weights: {w}, Bias: {b}")
print(f"Min z: {z_dead.min():.2f}, Max z: {z_dead.max():.2f}")
print(f"Active neurons: {active_pct:.1f}% (0% = fully dead)")z | ReLU(z) | ReLU'(z)
--------------------------------
-3.0 | 0.0000 | 0.0
-1.0 | 0.0000 | 0.0
0.0 | 0.0000 | 0.0
0.5 | 0.5000 | 1.0
1.0 | 1.0000 | 1.0
3.0 | 3.0000 | 1.0
--- Dead Neuron Demo ---
Weights: [0.01 0.01], Bias: -10.0
Min z: -10.00, Max z: -9.98
Active neurons: 0.0% (0% = fully dead)Zero percent active. The neuron is permanently dead regardless of the input. Weights never update.
When to Use ReLU
Use ReLU when:
- Default choice for hidden layers in feedforward networks and CNNs — try this first before considering alternatives
- Computational budget is tight — max(0, z) is the cheapest nonlinearity available
Do not use ReLU when:
- You observe the dead neuron problem — training loss stalls and stays stuck (switch to Leaky ReLU)
- Output layer — use sigmoid for binary classification, softmax for multiclass, linear for regression
- Input features have large negative values and your learning rate is high — this combination kills neurons early
Honest Limitations
Dead neurons. When z ≤ 0 for all inputs a neuron will ever see, it outputs zero and receives zero gradient. Its weights will never move again. With a learning rate of 0.01, about 10% of neurons die. With 0.1, up to 40%. The Leaky ReLU variant (next post) prevents this by letting a small gradient pass for z < 0.
Not zero-centered. All ReLU outputs are ≥ 0. Like sigmoid, this creates the same-sign gradient problem for upstream weights. Batch normalization compensates by re-centering the distribution of activations before the next layer.
Unbounded output. ReLU has no upper limit. A neuron with z = 1000 outputs 1000. In a deep network without batch normalization, this can cause activations to explode layer by layer. He initialization is designed specifically for ReLU to prevent this at the start of training.
Related Concepts
Where this builds from: The vanishing gradient post showed that sigmoid's derivative (max 0.25) causes exponential decay in deep networks. ReLU's derivative = 1 for z > 0 is the direct fix. Tanh improved on sigmoid but still saturated — ReLU eliminates saturation for positive activations.
Where this leads: Leaky ReLU and PReLU address the dead neuron problem. He initialization is designed specifically for ReLU — its scaling factor accounts for ReLU zeroing out half the neurons. Batch normalization is often used with ReLU in very deep networks to address the unbounded output and not-zero-centered limitations.
Test Your Understanding
-
In a 10-layer ReLU network with w = 0.3 at every connection, and assuming all neurons are active (z > 0), compute the gradient magnitude at layer 1 if the output gradient is 1.0. How does this compare to the same calculation with sigmoid (max derivative 0.25)?
-
A neuron in layer 3 has weights w = [0.5, −0.8, 0.3] and bias b = 0.0. The input is x = [0.4, 0.6, 0.1]. Compute z and ReLU(z). Is this neuron active? What would its gradient be during backpropagation?
-
You train a ReLU network with learning rate 0.5 and observe training loss drop sharply for 3 epochs then completely plateau. What is the most likely cause? What would you check first, and what would you change?
-
ReLU(z) = max(0, z) is not differentiable at z = 0. In practice, all deep learning frameworks set the gradient to 0 at z = 0. Argue why this choice (rather than 1 or 0.5) is harmless in practice, even though it is technically arbitrary.
-
A network alternates ReLU and sigmoid layers: input → ReLU → sigmoid → ReLU → sigmoid → output. Trace the gradient through the sigmoid layers only. Does the vanishing gradient problem reappear? What does this tell you about mixing activation functions in a single network?