~/blog
Leaky ReLU and Parametric ReLU
ReLU kills neurons. When z ≤ 0 for every input a neuron encounters, its output is zero and its gradient is zero — the weight never moves again. The fix is simple: instead of zero for negative z, allow a small nonzero slope. The neuron might not contribute much to the forward pass, but it still receives a gradient signal and can recover if the input distribution shifts.
Leaky ReLU applies a fixed small slope α = 0.01 for negative z. Parametric ReLU (PReLU) goes further: it learns the slope α during training, one value per channel.
Anchor z values: {−3, −1, 0, 0.5, 1, 3}.
Leaky ReLU
LeakyReLU(z) = z if z > 0 else αz, with α = 0.01 by default.
The gradient is:
LeakyReLU'(z) = 1 if z > 0 else α
For z = −1: gradient = 0.01. Not 0. The neuron is not dead — it receives a small update each step.
| z | ReLU(z) | LeakyReLU(z, α=0.01) | LeakyReLU'(z) |
|---|---|---|---|
| −3 | 0.0000 | −0.0300 | 0.01 |
| −1 | 0.0000 | −0.0100 | 0.01 |
| 0 | 0.0000 | 0.0000 | 0.01* |
| 0.5 | 0.5000 | 0.5000 | 1.00 |
| 1 | 1.0000 | 1.0000 | 1.00 |
| 3 | 3.0000 | 3.0000 | 1.00 |
*Set to α at boundary in practice.
Gradient Recovery in the 5-Layer Network
Using the same 5-layer anchor from the vanishing gradient post, but with all neurons having z < 0 (worst case for ReLU) and Leaky ReLU with α = 0.01, w = 0.1:
ReLU at each dead layer: δ₁ = 1.0 × 0 × 0 × 0 × 0 = 0 — completely dead.
Leaky ReLU: Each layer contributes 0.1 × 0.01 = 0.001:
δ₃ = 1.0 × 0.001 = 0.001
δ₂ = 0.001 × 0.001 = 1.0 × 10⁻⁶
δ₁ = 1.0 × 10⁻⁶ × 0.001 = 1.0 × 10⁻⁹
Still very small — but not zero. The gradient is 1.0 × 10⁻⁹ instead of 0. The neuron can wake up if the input distribution shifts (e.g., another neuron's weight change causes z to eventually become positive for some inputs). With ReLU dead neurons, recovery is impossible.
Parametric ReLU (PReLU)
PReLU uses the same formula but treats α as a learnable parameter:
PReLU(z) = z if z > 0 else αz, with α initialized to 0.25 and updated by backpropagation.
The gradient with respect to z is the same as Leaky ReLU: 1 for z > 0, α for z ≤ 0.
The gradient with respect to α is new. For a sample where z ≤ 0:
∂L/∂α = ∂L/∂a × ∂a/∂α = δ × z
where δ is the gradient arriving from the next layer and z is the pre-activation for this neuron. α is updated by gradient descent: α ← α − η × δ × z.
This means the network learns how much gradient to allow through negative activations — different channels can learn different α values depending on what works best for that feature.
| z | LeakyReLU(α=0.01) | PReLU(α=0.25) | PReLU(α=0.5) |
|---|---|---|---|
| −3 | −0.0300 | −0.750 | −1.500 |
| −1 | −0.0100 | −0.250 | −0.500 |
| 0 | 0.0000 | 0.000 | 0.000 |
| 0.5 | 0.5000 | 0.500 | 0.500 |
| 1 | 1.0000 | 1.000 | 1.000 |
| 3 | 3.0000 | 3.000 | 3.000 |
For positive z, all three are identical. The difference is entirely in how negative activations are handled.
PReLU outperforms Leaky ReLU on large datasets where the extra parameter can be meaningfully tuned. He et al. (2015) showed that PReLU improved top-1 accuracy on ImageNet by 1.2% compared to ReLU for ResNet. On small datasets, PReLU can overfit — the learned α may be arbitrary noise.
Hyperparameter Sensitivity: Leaky ReLU α
import numpy as np
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
def leaky_relu_grad(z, alpha=0.01):
return np.where(z > 0, 1.0, alpha)
# Churn network anchor
X = np.array([[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]])
y = np.array([0, 1, 0, 1, 0], dtype=float)
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def train_network(alpha, epochs=200, lr=0.1, seed=42):
rng = np.random.default_rng(seed)
W1 = rng.normal(0, 0.1, (2, 2))
b1 = np.zeros(2)
W2 = rng.normal(0, 0.1, (2, 1))
b2 = np.zeros(1)
losses = []
for _ in range(epochs):
Z1 = X @ W1 + b1
A1 = leaky_relu(Z1, alpha)
Z2 = A1 @ W2 + b2
A2 = sigmoid(Z2).flatten()
loss = -np.mean(y * np.log(A2 + 1e-8) + (1 - y) * np.log(1 - A2 + 1e-8))
losses.append(loss)
dA2 = (A2 - y) / len(y)
dW2 = A1.T @ dA2.reshape(-1, 1)
db2 = dA2.sum()
dA1 = dA2.reshape(-1, 1) @ W2.T
dZ1 = dA1 * leaky_relu_grad(Z1, alpha)
dW1 = X.T @ dZ1
db1 = dZ1.sum(axis=0)
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return losses
print(f"{'α':>8} | {'Loss@10':>8} | {'Loss@50':>8} | {'Loss@200':>9} | Notes")
print("-" * 65)
for alpha in [0.001, 0.01, 0.1, 0.3]:
losses = train_network(alpha)
note = ""
if alpha < 0.005:
note = "still nearly dead for large neg z"
elif alpha > 0.2:
note = "nearly linear — loses non-linearity"
else:
note = "stable"
print(f"{alpha:>8.3f} | {losses[9]:>8.4f} | {losses[49]:>8.4f} | {losses[199]:>9.4f} | {note}")α | Loss@10 | Loss@50 | Loss@200 | Notes
-----------------------------------------------------------------
0.001 | 0.7021 | 0.6847 | 0.6732 | still nearly dead for large neg z
0.010 | 0.6894 | 0.5821 | 0.4103 | stable
0.100 | 0.6831 | 0.5634 | 0.3897 | stable
0.300 | 0.6745 | 0.5501 | 0.4218 | nearly linear — loses non-linearityα = 0.001 barely helps — for z = −3, the gradient is 0.001, and the neuron contributes almost nothing. α = 0.3 trains but loses some of the non-linear representational capacity that makes ReLU useful in the first place. α = 0.01 is the standard default and performs well.
Side-by-Side Comparison
| Property | ReLU | Leaky ReLU | PReLU |
|---|---|---|---|
| Dead neurons | Yes | No | No |
| Extra parameters | 0 | 0 | 1 per channel |
| Gradient at z < 0 | 0 | α = 0.01 (fixed) | α (learned) |
| Compute cost | Fastest | +1 multiply | +backprop through α |
| When to use | Default start | Dead ReLU observed | Large datasets, can tune |
Code: ReLU vs Leaky ReLU Forward Values
import numpy as np
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
def leaky_relu_grad(z, alpha=0.01):
return np.where(z > 0, 1.0, alpha)
z_vals = np.array([-3.0, -1.0, 0.0, 0.5, 1.0, 3.0])
print(f"{'z':>5} | {'ReLU':>6} | {'Leaky(0.01)':>12} | {'Leaky\\'':>8}")
print("-" * 42)
for z in z_vals:
relu_v = max(0.0, z)
lrelu_v = leaky_relu(z)
lrelu_g = leaky_relu_grad(z)
print(f"{z:>5.1f} | {relu_v:>6.4f} | {lrelu_v:>12.4f} | {lrelu_g:>8.2f}")z | ReLU | Leaky(0.01) | Leaky'
------------------------------------------
-3.0 | 0.0000 | -0.0300 | 0.01
-1.0 | 0.0000 | -0.0100 | 0.01
0.0 | 0.0000 | 0.0000 | 0.01
0.5 | 0.5000 | 0.5000 | 1.00
1.0 | 1.0000 | 1.0000 | 1.00
3.0 | 3.0000 | 3.0000 | 1.00Related Concepts
Where this builds from: ReLU's dead neuron problem is the direct motivation. When z ≤ 0 permanently, ReLU gradient = 0 and the neuron never recovers. Leaky ReLU is the minimal fix — change that 0 to α.
Where this leads: ELU is the smooth alternative — instead of a kink at z = 0 and a linear negative region, ELU uses an exponential curve that asymptotes to −α. This gives better optimization behavior at the cost of computing exp(). He initialization is also designed to work with these ReLU variants.
Honest Limitations
Leaky ReLU still has a kink at z = 0. The derivative is discontinuous at the boundary — it jumps from α to 1. This is not continuous differentiability, which some optimizers assume. ELU fixes this.
PReLU adds parameters that can overfit. On a dataset with 1,000 examples and a network with 512 channels, PReLU adds 512 extra parameters — one per channel. Each one needs enough gradient signal to learn a meaningful value. On small datasets, α converges to noise.
α > 0.1 loses non-linearity. If α approaches 1.0, Leaky ReLU becomes approximately linear (output ≈ z for all z). A linear activation is equivalent to having no hidden layers at all — the composition of linear functions is linear. Keep α small.
Test Your Understanding
-
A Leaky ReLU neuron with α = 0.01 has z = −5 for a given input. Its incoming gradient from the next layer is δ = 0.8. Compute the gradient that flows backward through this neuron and the weight update Δw for a weight w connected to input x = 0.3, with learning rate 0.01.
-
PReLU learns α per channel. In layer 2 of a CNN with 64 channels, PReLU adds 64 extra parameters. After training, you inspect the learned α values and find that 50 of the 64 channels have α ≈ 0.01 while 14 have α ≈ 0.4. What does this tell you about the data distribution those 14 channels are responding to?
-
Leaky ReLU with α = 0.5 and standard ReLU both process the same 4-layer network with w = 0.5 at every connection. Compute the gradient at layer 1 for a sample where every neuron has z < 0. For ReLU it is 0; for Leaky ReLU at α = 0.5, what is it?
-
A team trains a network with Leaky ReLU and observes that after 50 epochs, 80% of neurons in layer 3 have z < 0 for all training inputs. Is this a problem with Leaky ReLU, or is something else wrong? What would you investigate?
-
Consider a Leaky ReLU with α → 1. The activation approaches linear: output = z for all z. A 3-hidden-layer network with all linear activations is equivalent to a single linear layer (composition of linear functions = linear). At what value of α does this become a practical concern, and how would you detect it in practice without inspecting α directly?