~/blog
Tanh Activation Function
Sigmoid solves one problem — mapping z to a probability — but introduces another: every output is positive, which forces all upstream weight gradients to update in the same direction simultaneously. Tanh fixes this by centering its output at zero. It is the same S-shape, same saturation, same basic behavior — just shifted so that zero input produces zero output and the gradient problem from sigmoid's asymmetry disappears.
The same six anchor z values as the sigmoid post: z ∈ {−5, −2, 0, 0.5, 2, 5}.
The Formula
tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)
It is algebraically equivalent to the bipolar sigmoid from the previous post. The connection to sigmoid:
Step 1: Start with σ(2z) = 1/(1 + e⁻²ᶻ)
Step 2: Multiply numerator and denominator by eᶻ: eᶻ/(eᶻ + e⁻ᶻ)
Step 3: Then 2σ(2z) − 1 = 2eᶻ/(eᶻ + e⁻ᶻ) − (eᶻ + e⁻ᶻ)/(eᶻ + e⁻ᶻ) = (eᶻ − e⁻ᶻ)/(eᶻ + e⁻ᶻ) = tanh(z) ✓
So tanh(z) = 2σ(2z) − 1. If you know sigmoid, you know tanh — it is a rescaled and shifted version.
Computing tanh(z) for the Six Anchor Values
| z | eᶻ | e⁻ᶻ | eᶻ−e⁻ᶻ | eᶻ+e⁻ᶻ | tanh(z) |
|---|---|---|---|---|---|
| −5 | 0.0067 | 148.41 | −148.40 | 148.42 | −0.9999 |
| −2 | 0.1353 | 7.389 | −7.254 | 7.524 | −0.9640 |
| 0 | 1.0000 | 1.000 | 0.000 | 2.000 | 0.0000 |
| 0.5 | 1.6487 | 0.607 | 1.042 | 2.256 | 0.4621 |
| 2 | 7.389 | 0.1353 | 7.254 | 7.524 | 0.9640 |
| 5 | 148.41 | 0.0067 | 148.40 | 148.42 | 0.9999 |
The key difference from sigmoid: tanh(0) = 0, not 0.5. The output is symmetric around zero.
Why Zero-Centering Matters
Here is the gradient problem with sigmoid: during backpropagation, the gradient for weight wⱼ in a layer feeding into sigmoid is proportional to:
∂L/∂wⱼ = δ · xⱼ
where xⱼ is the activation from the previous layer. If the previous layer used sigmoid, then xⱼ ∈ (0, 1) — always positive. This means the gradient ∂L/∂wⱼ always has the same sign as δ. All weights in that layer update in the same direction — either all increase or all decrease together.
If the true optimum requires w₁ to increase and w₂ to decrease simultaneously, sigmoid cannot do this in a single step. It zig-zags: first increases both (overshoots w₂), then decreases both (undershoots w₁), converging in a staircase path instead of moving directly to the minimum.
With tanh, the previous layer's activations are in (−1, 1) — they can be positive or negative. The gradient ∂L/∂wⱼ = δ · xⱼ can now be positive or negative independently for each wⱼ, depending on the sign of xⱼ. Different weights update in different directions in the same step. The optimizer takes a more direct path.
The practical difference is measured in training time and final loss. Networks with tanh hidden layers converge in fewer steps than sigmoid networks when the learning task requires different weights to move in different directions simultaneously — which is almost always.
The Derivative
tanh'(z) = 1 − tanh²(z)
Derivation using the quotient rule on tanh(z) = (eᶻ − e⁻ᶻ)/(eᶻ + e⁻ᶻ):
Step 1: Let u = eᶻ − e⁻ᶻ, v = eᶻ + e⁻ᶻ. Then du/dz = eᶻ + e⁻ᶻ = v, dv/dz = eᶻ − e⁻ᶻ = u.
Step 2: Quotient rule: dtanh/dz = (v·v − u·u)/v² = (v² − u²)/v²
Step 3: Expand: v² − u² = (eᶻ+e⁻ᶻ)² − (eᶻ−e⁻ᶻ)² = 4 (using difference of squares)
Step 4: So dtanh/dz = 4/v² = 4/(eᶻ+e⁻ᶻ)² = 1 − (eᶻ−e⁻ᶻ)²/(eᶻ+e⁻ᶻ)² = 1 − tanh²(z) ✓
The maximum is at z = 0: tanh'(0) = 1 − 0² = 1.0 — four times larger than sigmoid's max of 0.25.
tanh'(z) for the Six Anchor Values
| z | tanh(z) | tanh²(z) | tanh'(z) = 1−tanh² |
|---|---|---|---|
| −5 | −0.9999 | 0.9998 | 0.0002 |
| −2 | −0.9640 | 0.9293 | 0.0707 |
| 0 | 0.0000 | 0.0000 | 1.0000 |
| 0.5 | 0.4621 | 0.2135 | 0.7865 |
| 2 | 0.9640 | 0.9293 | 0.0707 |
| 5 | 0.9999 | 0.9998 | 0.0002 |
The maximum gradient is 1.0, not 0.25. Gradient passing through a tanh neuron at z=0 is unchanged. But saturation at |z| > 2 still brings the derivative to near zero — tanh doesn't eliminate the vanishing gradient, it reduces it.
Saturation Comparison
| Activation | Range | Max gradient | Saturates at | Zero-centered |
|---|---|---|---|---|
| Sigmoid | (0, 1) | 0.25 | |z| > 2 | ✗ |
| Tanh | (−1, 1) | 1.0 | |z| > 2 | ✓ |
| ReLU | [0, ∞) | 1.0 (for z>0) | z < 0 only | ✗ |
Tanh and ReLU have the same max gradient (1.0), but they saturate in different ways. Tanh saturates symmetrically for large |z|. ReLU saturates permanently for z < 0 (dead neuron problem), but never saturates for z > 0. This is why ReLU dominates feedforward networks while tanh dominates recurrent networks where the gating behavior of saturation is actually useful.
When to Use Tanh
Use tanh when:
- Hidden layers in RNNs — recurrent networks benefit from zero-centered activations because the same weights process inputs at many timesteps; zig-zag gradients compound badly over time
- When you need symmetric outputs from a hidden layer (positive and negative activations mean the next layer can learn both additive and subtractive combinations naturally)
Do not use tanh when:
- Building deep feedforward networks — ReLU or ELU will train faster and avoid saturation more effectively
- Output layer for classification — use sigmoid (binary) or softmax (multiclass)
Code
import numpy as np
def tanh_act(z):
return np.tanh(z)
def tanh_grad(z):
return 1 - np.tanh(z) ** 2
z_vals = np.array([-5.0, -2.0, 0.0, 0.5, 2.0, 5.0])
print(f"{'z':>5} | {'tanh(z)':>9} | {'tanh\\'(z)':>9} | {'σ(z)':>7} | {'σ\\'(z)':>7}")
print("-" * 52)
for z in z_vals:
t = tanh_act(z)
tg = tanh_grad(z)
s = 1 / (1 + np.exp(-z))
sg = s * (1 - s)
print(f"{z:>5.1f} | {t:>9.4f} | {tg:>9.4f} | {s:>7.4f} | {sg:>7.4f}")z | tanh(z) | tanh'(z) | σ(z) | σ'(z)
----------------------------------------------------
-5.0 | -0.9999 | 0.0002 | 0.0067 | 0.0066
-2.0 | -0.9640 | 0.0707 | 0.1192 | 0.1050
0.0 | 0.0000 | 1.0000 | 0.5000 | 0.2500
0.5 | 0.4621 | 0.7865 | 0.6225 | 0.2350
2.0 | 0.9640 | 0.0707 | 0.8808 | 0.1050
5.0 | 0.9999 | 0.0002 | 0.9933 | 0.0066At z = 0.5, tanh's gradient is 0.7865 vs sigmoid's 0.2350 — 3.3× larger. Over a 5-layer network, that difference compounds: 0.7865⁴ = 0.382 vs 0.2350⁴ = 0.003. Tanh is 127× less likely to vanish in this regime.
Honest Limitations
Still saturates for |z| > 2. The derivative at z = ±5 is 0.0002 — just as catastrophic as sigmoid's. Tanh does not eliminate the vanishing gradient problem, it reduces it. In networks deeper than about 6 layers, even tanh will suffer from vanishing gradients without additional techniques like residual connections or batch normalization.
Slower to compute than ReLU. Tanh requires two exponentials per neuron. ReLU is a single comparison. At the scale of ResNet-50 (25 million parameters), this difference adds up.
Gradient variance. Even though tanh is zero-centered, the magnitude of gradients can still vary widely across neurons depending on how close z is to zero. This is not fully addressed until batch normalization normalizes the distribution of pre-activations within each mini-batch.
Related Concepts
Where this builds from: Sigmoid introduced the S-curve and the saturation problem. Tanh is the zero-centered version — same shape, different range. The zig-zag gradient problem is the specific weakness sigmoid has that tanh fixes.
Where this leads: ReLU eliminates saturation entirely for positive activations, trading the zero-centering property for speed and resistance to vanishing gradients in deep networks. Tanh appears inside every LSTM cell as the activation for candidate memory — the cell update uses tanh to produce a zero-centered candidate that the gates then scale.
Test Your Understanding
-
Prove algebraically that tanh(z) = 2σ(2z) − 1, where σ is the standard sigmoid. Use the definition σ(z) = 1/(1+e⁻ᶻ) and simplify step by step.
-
A 4-layer network uses tanh hidden layers with z values at every neuron hovering around ±2 (due to large initial weights). Compute the approximate gradient at layer 1 using tanh'(2) = 0.0707 and w = 0.5. Compare this to the same network using ReLU for the same weights and a typical active neuron (z=1).
-
You replace sigmoid hidden layers with tanh in an existing network and observe that training is faster but the network still stalls at 15 epochs. What else could be causing the stall, and what techniques from the previous posts would you try next?
-
In an LSTM cell, the candidate memory is computed as c̃ = tanh(Wₓx + Wₕh + b). Why is tanh specifically chosen here rather than ReLU? What would break if you used ReLU for the candidate memory?
-
The zig-zag gradient problem occurs when all gradients have the same sign. If a tanh layer happens to produce mostly positive activations (e.g., inputs are all positive), does the zig-zag problem reappear? Explain using the formula ∂L/∂wⱼ = δ · xⱼ.