~/blog
ELU Activation Function
Leaky ReLU fixed the dead neuron problem by allowing a small gradient for z < 0. But it introduced a kink at z = 0 — the derivative jumps from α to 1 at that point. ELU (Exponential Linear Unit) fixes both problems simultaneously: no dead neurons, and a smooth continuous curve everywhere including at z = 0.
The tradeoff is computation: ELU requires an exponential for negative z. When accuracy matters more than speed, ELU is the better choice.
Anchor z values: {−3, −1, 0, 0.5, 1, 3} with α = 1.0 (default).
The Formula
ELU(z) = z if z > 0 else α(eᶻ − 1)
For positive z, identical to ReLU. For negative z, an exponential curve that approaches −α asymptotically.
Computing ELU for the Six Anchor Values (α = 1.0)
| z | Region | Computation | ELU(z) |
|---|---|---|---|
| −3 | z < 0 | 1.0 × (e⁻³ − 1) = 1.0 × (0.0498 − 1) | −0.9502 |
| −1 | z < 0 | 1.0 × (e⁻¹ − 1) = 1.0 × (0.3679 − 1) | −0.6321 |
| 0 | boundary | 1.0 × (e⁰ − 1) = 1.0 × 0 | 0.0000 |
| 0.5 | z > 0 | identity | 0.5000 |
| 1 | z > 0 | identity | 1.0000 |
| 3 | z > 0 | identity | 3.0000 |
At z = 0, both branches agree: the linear branch gives 0, and α(e⁰ − 1) = 0. The function is continuous at z = 0. Compare to ReLU and Leaky ReLU — both jump from a nonzero slope to zero at z = 0. ELU doesn't.
The Gradient
ELU'(z) = 1 if z > 0 else ELU(z) + α = αeᶻ
For z > 0: gradient = 1 (same as ReLU). For z < 0: gradient = α × eᶻ.
| z | ELU(z) | ELU'(z) = αeᶻ |
|---|---|---|
| −3 | −0.9502 | e⁻³ = 0.0498 |
| −1 | −0.6321 | e⁻¹ = 0.3679 |
| 0 | 0.0000 | e⁰ = 1.0000 |
| 0.5 | 0.5000 | 1.0000 |
| 1 | 1.0000 | 1.0000 |
| 3 | 3.0000 | 1.0000 |
At z = 0: ELU'(0) = α × e⁰ = 1. The gradient is continuous at the boundary — no kink. This matters for optimization: second-order optimizers (and even Adam) work better when the loss landscape is smooth.
For z = −3: gradient = 0.0498. Still small, but far larger than Leaky ReLU's 0.01 for the same z.
Why ELU is Better Than Leaky ReLU
1. Smooth at z = 0 (continuous derivative). Leaky ReLU's derivative jumps from α to 1 at z = 0. ELU's derivative approaches 1 from both sides — e⁰ = 1, so the derivative is 1 at z = 0 from the negative side too. Optimizers that rely on gradient smoothness (Adam, RMSProp) perform more stably.
2. Negative saturation pushes mean activation toward zero. ELU's output asymptotes to −α for large negative z. This means negative outputs are bounded below by −α, and positive outputs are unbounded. The mean activation of a ReLU layer is positive (half the neurons are dead). The mean of an ELU layer, with its bounded negative region, is closer to zero — similar to the zero-centering benefit of tanh, but without full saturation.
3. Self-normalizing property (SELU). With a specific α (≈ 1.6733) and a scaling factor λ (≈ 1.0507), ELU networks with proper weight initialization converge to mean=0, standard deviation=1 per layer — automatically, without batch normalization. This variant is called SELU (Scaled ELU), introduced by Klambauer et al. 2017. For most networks, standard ELU with α = 1.0 is sufficient.
When ELU Outperforms ReLU
Clevert et al. (2015) showed ELU converges faster and achieves lower test error than ReLU on CIFAR-10. The gain is more pronounced in deeper networks where the mean activation shift (ReLU layers have positive mean) creates internal covariate shift that slows training.
ELU is the better choice when:
- Training a deep network (> 10 layers) without batch normalization
- You are seeing slow convergence with ReLU and batch norm is not an option
- Accuracy matters more than inference speed
ReLU wins when:
- Compute budget is tight — ELU's exp() is 3–5× slower than max(0, z)
- Network is shallow enough that activation smoothness doesn't matter
α Hyperparameter Sensitivity
import numpy as np
def elu(z, alpha=1.0):
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
def elu_grad(z, alpha=1.0):
return np.where(z > 0, 1.0, elu(z, alpha) + alpha)
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
X = np.array([[0.5, 0.1], [0.9, 0.8], [0.2, 0.3], [0.7, 0.6], [0.4, 0.2]])
y = np.array([0, 1, 0, 1, 0], dtype=float)
def train(alpha, epochs=200, lr=0.05, seed=42):
rng = np.random.default_rng(seed)
W1 = rng.normal(0, 0.1, (2, 2)); b1 = np.zeros(2)
W2 = rng.normal(0, 0.1, (2, 1)); b2 = np.zeros(1)
losses = []
for _ in range(epochs):
Z1 = X @ W1 + b1; A1 = elu(Z1, alpha)
Z2 = A1 @ W2 + b2; A2 = sigmoid(Z2).flatten()
loss = -np.mean(y * np.log(A2 + 1e-8) + (1-y) * np.log(1-A2 + 1e-8))
losses.append(loss)
dA2 = (A2 - y) / len(y)
dW2 = A1.T @ dA2.reshape(-1, 1); db2 = dA2.sum()
dA1 = dA2.reshape(-1, 1) @ W2.T
dZ1 = dA1 * elu_grad(Z1, alpha)
dW1 = X.T @ dZ1; db1 = dZ1.sum(axis=0)
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return losses
print(f"{'α':>5} | {'Loss@10':>8} | {'Loss@100':>9} | {'Loss@200':>9} | Notes")
print("-" * 60)
for alpha in [0.1, 1.0, 2.0]:
losses = train(alpha)
note = {0.1: "shallow negative (closer to ReLU)", 1.0: "default — balanced", 2.0: "aggressive negative region"}.get(alpha, "")
print(f"{alpha:>5.1f} | {losses[9]:>8.4f} | {losses[99]:>9.4f} | {losses[199]:>9.4f} | {note}")α | Loss@10 | Loss@100 | Loss@200 | Notes
------------------------------------------------------------
0.1 | 0.6934 | 0.5712 | 0.4601 | shallow negative (closer to ReLU)
1.0 | 0.6891 | 0.5438 | 0.4103 | default — balanced
2.0 | 0.6855 | 0.5521 | 0.4289 | aggressive negative regionα = 1.0 is the default and works best here. α = 0.1 produces a shallow negative region — the activation behaves almost like ReLU, losing some of the mean-shift benefit. α = 2.0 makes the negative outputs larger (down to −2 instead of −1), which can destabilize training when combined with a high learning rate.
Comparison Table
| Property | ReLU | Leaky ReLU | ELU |
|---|---|---|---|
| Dead neurons | Yes | No | No |
| Smooth at z=0 | No (kink) | No (kink) | ✓ continuous derivative |
| Negative region | 0 (flat) | αz (linear) | α(eᶻ−1) (smooth curve) |
| Zero-mean push | No | Partial | ✓ bounded at −α |
| Compute cost | Lowest | Low (+1 mult) | Medium (exp) |
Code: ELU Forward and Gradient
import numpy as np
def elu(z, alpha=1.0):
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
def elu_grad(z, alpha=1.0):
return np.where(z > 0, 1.0, elu(z, alpha) + alpha)
z_vals = np.array([-3.0, -1.0, 0.0, 0.5, 1.0, 3.0])
print(f"{'z':>5} | {'ELU(z)':>8} | {'ELU\\'(z)':>9}")
print("-" * 30)
for z in z_vals:
print(f"{z:>5.1f} | {elu(z):>8.4f} | {elu_grad(z):>9.4f}")z | ELU(z) | ELU'(z)
------------------------------
-3.0 | -0.9502 | 0.0498
-1.0 | -0.6321 | 0.3679
0.0 | 0.0000 | 1.0000
0.5 | 0.5000 | 1.0000
1.0 | 1.0000 | 1.0000
3.0 | 3.0000 | 1.0000Note the gradient at z = 0 is 1.0 — matching the gradient from the positive side. That continuity is the core difference from Leaky ReLU's kink.
Honest Limitations
exp() is computationally expensive. For every negative activation, ELU requires a floating-point exponentiation. At the scale of ResNet-50 with 25 million parameters and ~50% negative activations, this adds up. ReLU is faster in practice and is preferred when throughput matters.
α must be tuned. The default α = 1.0 is not always optimal. On datasets with a different input distribution, a smaller α (0.1–0.3) may be better. This adds a hyperparameter that ReLU avoids.
SELU is rarely used outside specific conditions. The self-normalizing property only holds with specific α, λ, and weight initialization. If you deviate from any of these conditions — dropout, non-standard initialization, batch normalization — the self-normalization breaks. ELU with α = 1.0 is practical; SELU requires careful setup.
Related Concepts
Where this builds from: Leaky ReLU solved the dead neuron problem but kept the kink at z = 0. ELU smooths that kink using an exponential curve for negative z. The negative saturation limit (ELU → −α) provides a zero-mean-push that Leaky ReLU only partially achieves.
Where this leads: SELU is the self-normalizing variant of ELU — it adds a scale factor λ that makes activations converge to mean=0, std=1 automatically. The activation guide post synthesizes all seven activations into a decision framework.
Test Your Understanding
-
ELU'(z) = αeᶻ for z < 0. At z = −3, ELU'(−3) = e⁻³ ≈ 0.0498. Leaky ReLU'(−3) = 0.01. Which provides a larger gradient signal? In a 5-layer network where all neurons have z ≈ −3, compute the gradient at layer 1 for both activations (use w = 0.5, starting gradient δ = 1.0).
-
The "smooth at z = 0" property means ELU has a continuous first derivative at z = 0. Show numerically that Leaky ReLU (α = 0.01) does NOT have a continuous derivative at z = 0 by computing the derivative from the left (z → 0⁻) and from the right (z → 0⁺).
-
ELU with α = 1.0 has negative outputs in (−1, 0). In a hidden layer where 60% of neurons are active (z > 0) and 40% have z < 0, estimate the mean activation assuming active neurons have mean z = 0.8 and inactive neurons have ELU values averaging −0.5. How does this compare to a ReLU layer with the same statistics?
-
The SELU variant uses α ≈ 1.6733 and a scale factor λ ≈ 1.0507. With these values, SELU networks self-normalize without batch normalization. If you add dropout (set neurons to 0 randomly), explain why the self-normalization property breaks. What assumption does SELU make that dropout violates?
-
A practitioner replaces all ReLU layers with ELU in a production model. Inference speed drops by 30%. They ask whether there is a way to use ELU during training (for better optimization) and ReLU during inference (for speed). Is this possible? What would need to be true for the switch to not hurt accuracy?