~/blog
Sigmoid Activation Function
A neuron computes a weighted sum z = w·x + b. That number can be anything: −1000, 0, 47.3. But for a binary classification output — "will this loan default?" — you need a number between 0 and 1 that you can interpret as a probability. Sigmoid is the function that does that translation.
The anchor throughout this post is a loan default classifier. The neuron receives a pre-activation z computed from an applicant's features, and sigmoid converts z into a probability of default.
The Formula
σ(z) = 1 / (1 + e⁻ᶻ)
To understand why this always outputs in (0, 1):
- When z → +∞: e⁻ᶻ → 0, so σ(z) → 1/(1+0) = 1
- When z → −∞: e⁻ᶻ → ∞, so σ(z) → 1/(1+∞) = 0
- When z = 0: σ(0) = 1/(1+1) = 0.5
The function is strictly monotone — larger z always means larger σ(z) — and it never actually reaches 0 or 1, only approaches them asymptotically. This makes the output a valid open-interval probability.
Computing σ(z) for Six Representative Values
| z | e⁻ᶻ | 1 + e⁻ᶻ | σ(z) = 1/(1+e⁻ᶻ) |
|---|---|---|---|
| −5 | e⁵ = 148.41 | 149.41 | 0.0067 |
| −2 | e² = 7.389 | 8.389 | 0.1192 |
| 0 | e⁰ = 1.000 | 2.000 | 0.5000 |
| 0.5 | e⁻⁰·⁵ = 0.607 | 1.607 | 0.6225 |
| 2 | e⁻² = 0.135 | 1.135 | 0.8808 |
| 5 | e⁻⁵ = 0.0067 | 1.0067 | 0.9933 |
Applicant with z = −5: 0.67% probability of default. Applicant with z = 5: 99.33% probability of default. The decision threshold is at z = 0 (50%). Anything above zero → predict default; below zero → predict no default.
The Derivative
To train the network, backpropagation needs ∂σ/∂z. The derivation uses the quotient rule on σ(z) = (1 + e⁻ᶻ)⁻¹:
Step 1: Let u = 1, v = 1 + e⁻ᶻ. Then σ = u/v.
Step 2: Quotient rule: dσ/dz = (v·du/dz − u·dv/dz) / v² = (0 − 1·(−e⁻ᶻ)) / (1 + e⁻ᶻ)² = e⁻ᶻ / (1 + e⁻ᶻ)²
Step 3: Rewrite: e⁻ᶻ / (1 + e⁻ᶻ)² = [1/(1 + e⁻ᶻ)] × [e⁻ᶻ/(1 + e⁻ᶻ)] = σ(z) × [1 − 1/(1 + e⁻ᶻ)]
Step 4: Simplify: σ(z) × [1 − σ(z)]
Therefore: σ'(z) = σ(z) · (1 − σ(z))
This is the elegant result. You already computed σ(z) during the forward pass — reuse it. The derivative costs one multiplication and one subtraction.
σ'(z) for the Six Anchor Values
| z | σ(z) | 1 − σ(z) | σ'(z) = σ(z)(1−σ(z)) |
|---|---|---|---|
| −5 | 0.0067 | 0.9933 | 0.0066 |
| −2 | 0.1192 | 0.8808 | 0.1050 |
| 0 | 0.5000 | 0.5000 | 0.2500 |
| 0.5 | 0.6225 | 0.3775 | 0.2350 |
| 2 | 0.8808 | 0.1192 | 0.1050 |
| 5 | 0.9933 | 0.0067 | 0.0066 |
Maximum derivative is 0.25 at z = 0. For z = ±5, the derivative is 0.0066 — the gradient through this neuron shrinks to 0.66% of its incoming value. This is the vanishing gradient problem in a single number.
Bipolar Sigmoid (Version 2)
Some formulations use a "bipolar sigmoid" that outputs in (−1, 1) instead of (0, 1). The most common version is:
σ₂(z) = 2σ(z) − 1 = (1 − e⁻ᶻ) / (1 + e⁻ᶻ)
This is identical to tanh(z) — the two are related by a linear rescaling. The bipolar sigmoid is sometimes called "v2" to distinguish it from the standard sigmoid. Its output range (−1, 1) is zero-centered, which eliminates the same-sign gradient problem described below.
| z | σ(z) | σ₂(z) = 2σ(z)−1 |
|---|---|---|
| −5 | 0.0067 | −0.9866 |
| −2 | 0.1192 | −0.7616 |
| 0 | 0.5000 | 0.0000 |
| 0.5 | 0.6225 | 0.2449 |
| 2 | 0.8808 | 0.7616 |
| 5 | 0.9933 | 0.9866 |
The key difference: σ₂(0) = 0, not 0.5. When z = 0, the bipolar sigmoid outputs exactly zero — it is zero-centered. Standard sigmoid outputs 0.5 at z = 0, creating an asymmetry that causes the zig-zag gradient problem.
Properties Summary
| Property | Standard Sigmoid σ(z) | Bipolar Sigmoid σ₂(z) |
|---|---|---|
| Range | (0, 1) | (−1, 1) |
| Output at z=0 | 0.5 | 0.0 |
| Zero-centered | ✗ | ✓ |
| Saturates at | z ≪ 0 and z ≫ 0 | z ≪ 0 and z ≫ 0 |
| Max derivative | 0.25 | 0.50 (at z=0) |
| Vanishing gradient risk | High | Lower (but still exists) |
| Best use | Binary output layer, LSTM gates | Rarely — use tanh directly |
When to Use Sigmoid
Use sigmoid when:
- You have a binary classification output neuron — sigmoid output is directly interpretable as P(y=1|x)
- Inside LSTM and GRU gate functions — the gates require values in [0, 1] to control how much information passes through
- You need calibrated probability estimates rather than raw logits
Do not use sigmoid when:
- You are building hidden layers in a deep network — use ReLU to avoid vanishing gradients
- You have a multiclass output — use softmax instead (cross-reference the softmax post)
- You need fast inference at scale — exp() is expensive compared to max(0, z)
Code
import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_grad(z):
s = sigmoid(z)
return s * (1 - s)
z_vals = np.array([-5.0, -2.0, 0.0, 0.5, 2.0, 5.0])
print(f"{'z':>5} | {'σ(z)':>7} | {'σ\\'(z)':>7} | {'σ₂(z)':>8}")
print("-" * 40)
for z in z_vals:
s = sigmoid(z)
g = sigmoid_grad(z)
s2 = 2 * s - 1
print(f"{z:>5.1f} | {s:>7.4f} | {g:>7.4f} | {s2:>8.4f}")
# Plot
z = np.linspace(-6, 6, 300)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot(z, sigmoid(z), color="#3b82f6", linewidth=2.5, label="σ(z)")
axes[0].plot(z, 2*sigmoid(z)-1, color="#dc2626", linewidth=2, linestyle="--", label="σ₂(z) = 2σ−1")
axes[0].axhline(0.5, color="#94a3b8", linestyle=":", linewidth=1)
axes[0].axhline(0, color="#94a3b8", linestyle=":", linewidth=1)
axes[0].set_title("Sigmoid vs Bipolar Sigmoid"); axes[0].legend()
axes[1].plot(z, sigmoid_grad(z), color="#3b82f6", linewidth=2.5)
axes[1].axhline(0.25, color="#f59e0b", linestyle="--", linewidth=1.5, label="max = 0.25")
axes[1].set_title("Sigmoid Derivative σ'(z)"); axes[1].legend()
plt.tight_layout()
plt.savefig("sigmoid_plots.png", dpi=150)
print("\nPlot saved to sigmoid_plots.png")z | σ(z) | σ'(z) | σ₂(z)
----------------------------------------
-5.0 | 0.0067 | 0.0066 | -0.9866
-2.0 | 0.1192 | 0.1050 | -0.7616
0.0 | 0.5000 | 0.2500 | 0.0000
0.5 | 0.6225 | 0.2350 | 0.2449
2.0 | 0.8808 | 0.1050 | 0.7616
5.0 | 0.9933 | 0.0066 | 0.9866
Plot saved to sigmoid_plots.pngHonest Limitations
Not zero-centered. Sigmoid always outputs a positive number (between 0 and 1). During backpropagation, the gradient with respect to a weight wⱼ in the layer before sigmoid is ∂L/∂wⱼ = δ · xⱼ, where δ carries the same sign for all wⱼ because all outputs are positive. This means all weights in that layer update in the same direction — either all increase or all decrease together. The optimizer cannot increase some weights while decreasing others in the same step, so it zig-zags toward the optimum instead of moving directly. Tanh solves this by being zero-centered.
Computationally expensive. Computing e⁻ᶻ requires a floating-point exponentiation at every neuron. ReLU is max(0, z) — a single comparison. In a network with millions of neurons, this difference adds up.
Vanishing gradients in deep networks. With a maximum derivative of 0.25, sigmoid multiplies the gradient by at most 0.25 per layer. After four layers: 0.25⁴ = 0.0039. The full picture is in the previous post on the vanishing gradient problem.
Related Concepts
Where this builds from: The vanishing gradient problem explains why sigmoid's max derivative of 0.25 is a serious constraint for deep networks. Binary cross-entropy loss — the loss function typically paired with sigmoid output — is what makes the combined (BCE + sigmoid) gradient simplify to (ŷ − y), the clean expression used in backpropagation.
Where this leads: Tanh is the direct improvement — it is zero-centered and has a larger max derivative of 1.0, eliminating the zig-zag problem while keeping the S-shape. Sigmoid also appears inside every LSTM gate, where the (0,1) range is not a limitation but a requirement — the gate needs to express "how much to keep" on a [0,1] scale.
Test Your Understanding
-
The sigmoid derivative σ'(z) = σ(z)(1−σ(z)) achieves its maximum of 0.25 at z = 0. Without computing, determine at what value of z the derivative equals 0.1050. Is there more than one such z? Why?
-
A binary classifier outputs σ(z) = 0.62 for a loan applicant. The true label is y = 0 (no default). Compute the binary cross-entropy loss BCE = −[y·log(σ) + (1−y)·log(1−σ)] and the gradient ∂BCE/∂z = σ(z) − y. What is the sign of the gradient, and what does it mean for the weight update?
-
You train a 3-layer network where every layer uses sigmoid (including hidden layers). All weights are initialized to 2.0. Predict, without running code, whether gradients will vanish or explode in this network, and justify your answer using the derivative formula.
-
The zig-zag gradient problem occurs because sigmoid is not zero-centered. Explain in geometric terms what "zig-zagging toward the optimum" means — draw or describe the path gradient descent takes on a 2D weight surface when both weights must always update in the same direction.
-
An LSTM forget gate computes f = σ(Wₓx + Wₕh + b). The gate value f must be in [0, 1] to act as a "how much to forget" multiplier. Why would replacing σ with ReLU in the LSTM gate break the architecture, even though ReLU is better for deep networks in general?