~/blog
Which Activation Function to Use When
You have now seen seven activation functions. Each one solved a specific problem: sigmoid added probability interpretation, tanh added zero-centering, ReLU eliminated saturation for positive activations, Leaky ReLU and PReLU fixed dead neurons, ELU smoothed the kink and pushed the mean toward zero, softmax converted logits to a distribution over K classes. The question now is which one to reach for first — and why.
This post is a synthesis, not a tutorial. No new math. Clear, opinionated recommendations backed by the reasons already established in the series.
The Master Comparison Table
| Activation | Range | Zero-centered | Dead neurons | Vanishing gradient | Speed | Best use |
|---|---|---|---|---|---|---|
| Sigmoid | (0, 1) | ✗ | No | Severe (max σ'=0.25) | Slow (exp) | Binary output, LSTM gates |
| Tanh | (−1, 1) | ✓ | No | Moderate (max=1.0) | Slow (2×exp) | RNN hidden states |
| ReLU | [0, ∞) | ✗ | Yes (10–40%) | None for z>0 | Fastest (max op) | Default hidden layer |
| Leaky ReLU | (−∞, ∞) | ✗ | No | None | Fast (+1 mult) | Dead ReLU observed |
| PReLU | (−∞, ∞) | ✗ | No | None | Fast (+backprop) | Large datasets, tunable |
| ELU | (−α, ∞) | Partial | No | None | Medium (exp) | Deep nets, accuracy priority |
| Softmax | (0,1) sum=1 | N/A | No | N/A | Slow (K×exp) | Multiclass output layer |
Decision Flowchart
Layer-Type Recommendations
| Network type | Hidden layers | Output layer |
|---|---|---|
| Feedforward (classification) | ReLU | Sigmoid (binary) or Softmax (multiclass) |
| Feedforward (regression) | ReLU | None (linear) |
| CNN | ReLU | Softmax or Sigmoid |
| RNN (vanilla) | Tanh | Task-dependent |
| LSTM | Tanh + Sigmoid (built into gates) | Task-dependent |
| Deep network (>20 layers) | ELU or ReLU + BatchNorm | Task-dependent |
The LSTM row is not a choice — the gates use sigmoid (must output in [0,1]) and the candidate memory uses tanh (zero-centered update). These are hard-coded into the LSTM architecture, not hyperparameters.
Common Mistakes
Mistake 1: Sigmoid in hidden layers.
The max derivative of sigmoid is 0.25. After 5 layers: 0.25⁵ = 0.001. First-layer weights learn 1000× slower than output-layer weights. Diagnosis: training loss stalls and the early layers' weight magnitudes barely change after 100 epochs. Fix: replace hidden sigmoid with ReLU. Keep sigmoid only at the binary output neuron.
Mistake 2: ReLU with training stuck.
If training loss drops for 2–3 epochs and then flatlines completely, check for dead neurons. Use a hook to log the fraction of neurons outputting zero on each mini-batch. If >50% of a layer's neurons are always zero, dead neurons are the cause. Fix: lower the learning rate, switch to Leaky ReLU (α = 0.01), or use He initialization to prevent neurons from dying at initialization.
Mistake 3: Softmax for multi-label classification.
Softmax forces all outputs to sum to 1. A sample cannot be 80% cat and 70% dog simultaneously — softmax would normalize these to ~53% and ~47%. For multi-label tasks (an image can contain both a cat and a dog), apply sigmoid independently to each output neuron and threshold at 0.5 per class.
Mistake 4: No input normalization with tanh.
Tanh saturates when |z| > 2. Without normalizing inputs to mean≈0, standard deviation≈1, the weighted sum z = w·x + b will be large in magnitude and tanh will operate in its flat region. The gradient will be near zero from the first forward pass. Fix: normalize inputs before training (StandardScaler or BatchNorm at the input).
Code: Activation Comparison
import numpy as np
def sigmoid(z): return 1/(1+np.exp(-np.clip(z, -500, 500)))
def tanh_act(z): return np.tanh(z)
def relu(z): return np.maximum(0, z)
def leaky(z): return np.where(z > 0, z, 0.01*z)
def elu(z): return np.where(z > 0, z, np.exp(z) - 1)
activations = {
'sigmoid': sigmoid,
'tanh': tanh_act,
'relu': relu,
'leaky': leaky,
'elu': elu,
}
z = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])
header = f"{'z':>5} | " + " | ".join(f"{k:>8}" for k in activations)
print(header)
print("-" * (len(header) + 2))
for zi in z:
row = f"{zi:>5.1f} | "
row += " | ".join(f"{fn(zi):>8.4f}" for fn in activations.values())
print(row)z | sigmoid | tanh | relu | leaky | elu
------------------------------------------------------------------
-3.0 | 0.0474 | -0.9951 | 0.0000 | -0.0300 | -0.9502
-1.0 | 0.2689 | -0.7616 | 0.0000 | -0.0100 | -0.6321
0.0 | 0.5000 | 0.0000 | 0.0000 | 0.0000 | 0.0000
1.0 | 0.7311 | 0.7616 | 1.0000 | 1.0000 | 1.0000
3.0 | 0.9526 | 0.9951 | 3.0000 | 3.0000 | 3.0000For z = −1: ReLU kills the activation entirely (0.0). Leaky passes a tiny −0.01 signal. ELU passes −0.632 — meaningful output that lets the next layer distinguish z=−1 from z=−3. For z > 0, all three are identical in behavior.
Every Recommendation Backed by a Reason
ReLU as default: max(0, z) is the cheapest nonlinearity (single comparison), gradient = 1 for z > 0 (no vanishing), and it trains faster than sigmoid/tanh in practice. Start here unless you have evidence of dead neurons.
Leaky ReLU when ReLU fails: If 20%+ of neurons in a layer output zero consistently, the dead neuron problem is active. Leaky ReLU's gradient of 0.01 for z < 0 is enough to keep neurons alive without changing the architecture.
ELU for deep networks without batch norm: ELU's smooth derivative and bounded negative region push activations closer to zero-mean, which reduces internal covariate shift in networks where batch normalization is not used.
Tanh for RNNs: Recurrent networks multiply activations through the same weight matrix at every timestep. The not-zero-centered property of ReLU compounds multiplicatively over timesteps in a way that tanh's symmetric range avoids.
Sigmoid only at output: The max gradient of 0.25 makes it unsuitable anywhere gradients pass through. At the output layer, no gradient passes through sigmoid — the loss is computed after it. Sigmoid's probability interpretation is its only remaining advantage, and that advantage only applies at the output.
Softmax only for mutually exclusive multiclass: The sum-to-1 constraint is only meaningful when exactly one class is correct. For multi-label or open-set problems, it is an incorrect constraint that distorts the probabilities.
Related Concepts
Where this builds from: Every activation post in this section — sigmoid (02), tanh (03), ReLU (04), Leaky ReLU and PReLU (05), ELU (06), softmax (07) — established the specific trade-off for each function. This post synthesizes those trade-offs into a decision framework.
Where this leads: Loss function selection (the next section) has the same structure — different loss functions have different properties that suit different tasks. The architecture posts (CNN, RNN, LSTM) will revisit activation choices in the context of specific layer types.
Test Your Understanding
-
A colleague proposes using ELU everywhere — hidden layers and output layer — to avoid the vanishing gradient problem. Explain specifically why ELU is a wrong choice for the output layer of a binary classifier and a multiclass classifier.
-
You inherit a 6-layer feedforward network that uses tanh throughout. Training is slow and the first two layers have near-zero weight magnitudes after 50 epochs. Without changing the architecture, what single change would most likely fix the slow training? What if the architecture change is allowed?
-
The decision flowchart places "dead neurons?" as the first check for hidden layers. If you detect dead neurons at epoch 10 and switch from ReLU to Leaky ReLU at that point, will the previously dead neurons revive? Explain what happens to those neurons' weights during the switch.
-
PReLU learns one α per channel. In a CNN layer with 256 channels, PReLU adds 256 parameters. For a dataset with 500 training samples, should you use PReLU? Justify your answer numerically — how many samples per parameter does this add, and is that ratio favorable?
-
A network using ReLU in all hidden layers and sigmoid at the output is trained to 98% accuracy on a binary classification task. A colleague suggests switching to ELU to "improve gradient flow." What would you check before making this change, and under what conditions would ELU actually improve the result?