~/blog

Which Activation Function to Use When

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

You have now seen seven activation functions. Each one solved a specific problem: sigmoid added probability interpretation, tanh added zero-centering, ReLU eliminated saturation for positive activations, Leaky ReLU and PReLU fixed dead neurons, ELU smoothed the kink and pushed the mean toward zero, softmax converted logits to a distribution over K classes. The question now is which one to reach for first — and why.

This post is a synthesis, not a tutorial. No new math. Clear, opinionated recommendations backed by the reasons already established in the series.


The Master Comparison Table

ActivationRangeZero-centeredDead neuronsVanishing gradientSpeedBest use
Sigmoid(0, 1)NoSevere (max σ'=0.25)Slow (exp)Binary output, LSTM gates
Tanh(−1, 1)NoModerate (max=1.0)Slow (2×exp)RNN hidden states
ReLU[0, ∞)Yes (10–40%)None for z>0Fastest (max op)Default hidden layer
Leaky ReLU(−∞, ∞)NoNoneFast (+1 mult)Dead ReLU observed
PReLU(−∞, ∞)NoNoneFast (+backprop)Large datasets, tunable
ELU(−α, ∞)PartialNoNoneMedium (exp)Deep nets, accuracy priority
Softmax(0,1) sum=1N/ANoN/ASlow (K×exp)Multiclass output layer

Decision Flowchart

Activation Function Selection Flowchart Is this an output layer? What kind of output? binary → Sigmoid multiclass → Softmax regression → Linear (none) No — hidden layer Seeing dead neurons? Yes Large dataset / tunable? No → Leaky ReLU Yes → PReLU / ELU No dead neurons Accuracy over speed? Yes → ELU No → ReLU (default) RNN/sequence model? → Tanh for hidden states

Layer-Type Recommendations

Network typeHidden layersOutput layer
Feedforward (classification)ReLUSigmoid (binary) or Softmax (multiclass)
Feedforward (regression)ReLUNone (linear)
CNNReLUSoftmax or Sigmoid
RNN (vanilla)TanhTask-dependent
LSTMTanh + Sigmoid (built into gates)Task-dependent
Deep network (>20 layers)ELU or ReLU + BatchNormTask-dependent

The LSTM row is not a choice — the gates use sigmoid (must output in [0,1]) and the candidate memory uses tanh (zero-centered update). These are hard-coded into the LSTM architecture, not hyperparameters.


Common Mistakes

Mistake 1: Sigmoid in hidden layers.

The max derivative of sigmoid is 0.25. After 5 layers: 0.25⁵ = 0.001. First-layer weights learn 1000× slower than output-layer weights. Diagnosis: training loss stalls and the early layers' weight magnitudes barely change after 100 epochs. Fix: replace hidden sigmoid with ReLU. Keep sigmoid only at the binary output neuron.

Mistake 2: ReLU with training stuck.

If training loss drops for 2–3 epochs and then flatlines completely, check for dead neurons. Use a hook to log the fraction of neurons outputting zero on each mini-batch. If >50% of a layer's neurons are always zero, dead neurons are the cause. Fix: lower the learning rate, switch to Leaky ReLU (α = 0.01), or use He initialization to prevent neurons from dying at initialization.

Mistake 3: Softmax for multi-label classification.

Softmax forces all outputs to sum to 1. A sample cannot be 80% cat and 70% dog simultaneously — softmax would normalize these to ~53% and ~47%. For multi-label tasks (an image can contain both a cat and a dog), apply sigmoid independently to each output neuron and threshold at 0.5 per class.

Mistake 4: No input normalization with tanh.

Tanh saturates when |z| > 2. Without normalizing inputs to mean≈0, standard deviation≈1, the weighted sum z = w·x + b will be large in magnitude and tanh will operate in its flat region. The gradient will be near zero from the first forward pass. Fix: normalize inputs before training (StandardScaler or BatchNorm at the input).


Code: Activation Comparison

python
import numpy as np

def sigmoid(z):    return 1/(1+np.exp(-np.clip(z, -500, 500)))
def tanh_act(z):   return np.tanh(z)
def relu(z):       return np.maximum(0, z)
def leaky(z):      return np.where(z > 0, z, 0.01*z)
def elu(z):        return np.where(z > 0, z, np.exp(z) - 1)

activations = {
    'sigmoid': sigmoid,
    'tanh':    tanh_act,
    'relu':    relu,
    'leaky':   leaky,
    'elu':     elu,
}

z = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])

header = f"{'z':>5} | " + " | ".join(f"{k:>8}" for k in activations)
print(header)
print("-" * (len(header) + 2))
for zi in z:
    row = f"{zi:>5.1f} | "
    row += " | ".join(f"{fn(zi):>8.4f}" for fn in activations.values())
    print(row)
text
z |  sigmoid |     tanh |     relu |    leaky |      elu
------------------------------------------------------------------
 -3.0 |   0.0474 |  -0.9951 |   0.0000 |  -0.0300 |  -0.9502
 -1.0 |   0.2689 |  -0.7616 |   0.0000 |  -0.0100 |  -0.6321
  0.0 |   0.5000 |   0.0000 |   0.0000 |   0.0000 |   0.0000
  1.0 |   0.7311 |   0.7616 |   1.0000 |   1.0000 |   1.0000
  3.0 |   0.9526 |   0.9951 |   3.0000 |   3.0000 |   3.0000

For z = −1: ReLU kills the activation entirely (0.0). Leaky passes a tiny −0.01 signal. ELU passes −0.632 — meaningful output that lets the next layer distinguish z=−1 from z=−3. For z > 0, all three are identical in behavior.


Every Recommendation Backed by a Reason

ReLU as default: max(0, z) is the cheapest nonlinearity (single comparison), gradient = 1 for z > 0 (no vanishing), and it trains faster than sigmoid/tanh in practice. Start here unless you have evidence of dead neurons.

Leaky ReLU when ReLU fails: If 20%+ of neurons in a layer output zero consistently, the dead neuron problem is active. Leaky ReLU's gradient of 0.01 for z < 0 is enough to keep neurons alive without changing the architecture.

ELU for deep networks without batch norm: ELU's smooth derivative and bounded negative region push activations closer to zero-mean, which reduces internal covariate shift in networks where batch normalization is not used.

Tanh for RNNs: Recurrent networks multiply activations through the same weight matrix at every timestep. The not-zero-centered property of ReLU compounds multiplicatively over timesteps in a way that tanh's symmetric range avoids.

Sigmoid only at output: The max gradient of 0.25 makes it unsuitable anywhere gradients pass through. At the output layer, no gradient passes through sigmoid — the loss is computed after it. Sigmoid's probability interpretation is its only remaining advantage, and that advantage only applies at the output.

Softmax only for mutually exclusive multiclass: The sum-to-1 constraint is only meaningful when exactly one class is correct. For multi-label or open-set problems, it is an incorrect constraint that distorts the probabilities.


Where this builds from: Every activation post in this section — sigmoid (02), tanh (03), ReLU (04), Leaky ReLU and PReLU (05), ELU (06), softmax (07) — established the specific trade-off for each function. This post synthesizes those trade-offs into a decision framework.

Where this leads: Loss function selection (the next section) has the same structure — different loss functions have different properties that suit different tasks. The architecture posts (CNN, RNN, LSTM) will revisit activation choices in the context of specific layer types.


Test Your Understanding

  1. A colleague proposes using ELU everywhere — hidden layers and output layer — to avoid the vanishing gradient problem. Explain specifically why ELU is a wrong choice for the output layer of a binary classifier and a multiclass classifier.

  2. You inherit a 6-layer feedforward network that uses tanh throughout. Training is slow and the first two layers have near-zero weight magnitudes after 50 epochs. Without changing the architecture, what single change would most likely fix the slow training? What if the architecture change is allowed?

  3. The decision flowchart places "dead neurons?" as the first check for hidden layers. If you detect dead neurons at epoch 10 and switch from ReLU to Leaky ReLU at that point, will the previously dead neurons revive? Explain what happens to those neurons' weights during the switch.

  4. PReLU learns one α per channel. In a CNN layer with 256 channels, PReLU adds 256 parameters. For a dataset with 500 training samples, should you use PReLU? Justify your answer numerically — how many samples per parameter does this add, and is that ratio favorable?

  5. A network using ReLU in all hidden layers and sigmoid at the output is trained to 98% accuracy on a binary classification task. A colleague suggests switching to ELU to "improve gradient flow." What would you check before making this change, and under what conditions would ELU actually improve the result?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment