~/blog
Input Gate & Candidate Memory
The forget gate decides what survives from the cell state's past. It never adds anything new. Once "He" has erased the gender dimension, the cell state needs fresh information written into that space — the input gate and candidate memory are the two components that handle writing new information in, and they split that job the same way the forget gate splits "how much" from "what."
Anchor: same setup as the forget gate post — hₜ₋₁ = [0.3, -0.1, 0.2, 0.5], xₜ = [0.8, 0.1, -0.3, 0.6], Cₜ₋₁ = [0.6, -0.4, 0.8, 0.2], concatenated input 8-dimensional. This post picks up where the forget gate left off and completes the cell state update.
Two Components Working Together
Candidate memory: C̃ₜ = tanh(WC·[hₜ₋₁, xₜ] + bC) — proposes what new information might be worth remembering, in range (-1, +1).
Input gate: iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — decides how much of that proposal to actually accept, in range (0, 1).
Why Two Components?
Splitting "what" from "how much" mirrors the forget gate's own split of "keep vs. erase" from "which dimension." C̃ₜ might propose "add 0.9 to this dimension" — but the input gate could respond "only accept 40% of that." Together, iₜ ⊙ C̃ₜ means "accept this proportion of this proposal," combining a magnitude/direction proposal with a separately-learned acceptance strength.
Using the anchor: C̃ₜ = [-0.1749, -0.0586, 0.0380, 0.0988] and iₜ = [0.4295, 0.4924, 0.5149, 0.5165].
| dim | C̃ₜ (proposal) | iₜ (accepted %) | new_info = iₜ⊙C̃ₜ | interpretation |
|---|---|---|---|---|
| 0 | -0.1749 | 0.4295 (43.0%) | -0.0751 | proposes erasing this dim, gate accepts less than half |
| 1 | -0.0586 | 0.4924 (49.2%) | -0.0289 | small negative nudge, roughly half accepted |
| 2 | 0.0380 | 0.5149 (51.5%) | 0.0196 | small positive nudge, just over half accepted |
| 3 | 0.0988 | 0.5165 (51.7%) | 0.0510 | largest proposal, also just over half accepted |
Dimension 0's candidate proposed -0.1749, but the input gate only accepted 42.95% of it, landing at -0.0751 — a proposal that gets partially, not fully, written in.
Cell State Update (Full)
After the forget gate: Cₜ_old_retained = fₜ ⊙ Cₜ₋₁ = [0.2862, -0.1961, 0.4421, 0.1029]
After the input gate: Cₜ = Cₜ_old_retained + iₜ⊙C̃ₜ = [0.2111, -0.2250, 0.4617, 0.1539]
This is an addition, not a multiplication chain — and that's the entire reason gradients survive across many timesteps. Differentiating: ∂Cₜ/∂Cₜ₋₁ = fₜ. Not a product of many terms accumulated across the network's depth, just the forget gate's value at this single step.
The Gradient Highway
The gradient of the loss with respect to Cₜ₋₁: ∂L/∂Cₜ₋₁ = ∂L/∂Cₜ × ∂Cₜ/∂Cₜ₋₁ = ∂L/∂Cₜ × fₜ
One multiplication by fₜ per step — not a product accumulated through tanh derivatives and weight matrices the way a vanilla RNN's hidden state gradient is. As long as fₜ stays reasonably far from 0, the gradient signal passing through the cell state survives across many timesteps largely intact. This is what post 01 called the "conveyor belt" — the input and forget gates together build it, but the addition in the cell state update is what makes it structurally different from the vanilla RNN's multiplicative chain.
Code
import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))
h_prev = np.array([0.3, -0.1, 0.2, 0.5])
x = np.array([0.8, 0.1, -0.3, 0.6])
C_prev = np.array([0.6, -0.4, 0.8, 0.2])
inp = np.concatenate([h_prev, x])
np.random.seed(7)
Wf = np.random.randn(4, 8) * 0.1; bf = np.zeros(4)
Wi = np.random.randn(4, 8) * 0.1; bi = np.zeros(4)
WC = np.random.randn(4, 8) * 0.1; bC = np.zeros(4)
f_t = sigmoid(Wf @ inp + bf)
i_t = sigmoid(Wi @ inp + bi)
C_tilde = np.tanh(WC @ inp + bC)
C_retained = f_t * C_prev
new_info = i_t * C_tilde
C_t = C_retained + new_info
print("Input gate i_t: ", np.round(i_t, 4))
print("Candidate C_tilde:", np.round(C_tilde, 4))
print("New info (i⊙C̃): ", np.round(new_info, 4))
print("C_retained: ", np.round(C_retained, 4))
print("C_t (new): ", np.round(C_t, 4))Input gate i_t: [0.4295 0.4924 0.5149 0.5165]
Candidate C_tilde: [-0.1749 -0.0586 0.038 0.0988]
New info (i⊙C̃): [-0.0751 -0.0289 0.0196 0.051 ]
C_retained: [ 0.2862 -0.1961 0.4421 0.1029]
C_t (new): [ 0.2111 -0.225 0.4617 0.1539]Hyperparameter Sensitivity: Weight Initialization Scale
Both iₜ and C̃ₜ pass a linear combination through a saturating nonlinearity (sigmoid, tanh) — the scale of the weight initialization directly controls how far into that saturation region the gate starts out.
import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))
h_prev = np.array([0.3, -0.1, 0.2, 0.5])
x = np.array([0.8, 0.1, -0.3, 0.6])
inp = np.concatenate([h_prev, x])
for scale in [0.1, 1.0, 5.0]:
np.random.seed(7)
Wf = np.random.randn(4, 8) * 0.1 # keeps the random stream aligned with the anchor
Wi = np.random.randn(4, 8) * scale
bi = np.zeros(4)
i_t = sigmoid(Wi @ inp + bi)
print(f"scale={scale:<4} i_t = {np.round(i_t, 4)}")scale=0.1 i_t = [0.4295 0.4924 0.5149 0.5165]
scale=1.0 i_t = [0.0553 0.4241 0.6445 0.6595]
scale=5.0 i_t = [0. 0.178 0.9514 0.9646]At scale=0.1 (the anchor's setting), iₜ sits close to 0.5 across all dimensions — the gate hasn't yet learned a strong opinion, which is the intended state for freshly initialized weights. At scale=5.0, dimensions push toward the extremes (0.0 and 0.96): the gate is already saturated before any training has happened, sigmoid's gradient there is close to zero, and backpropagation can barely adjust Wi anymore. Initializing gate weights too large is a common cause of an LSTM whose input gate never learns to move — it starts stuck fully open or fully closed and stays there.
Related Concepts
Where this builds from: The forget gate (post 03) produces fₜ⊙Cₜ₋₁, the "retained old" half of this post's addition — the input gate and candidate memory produce the other half. Tanh's (-1,1) range and sigmoid's (0,1) range were both established in post 02's overview.
Where this leads: Cₜ, completed here, feeds directly into the output gate (post 05), which decides how much of this updated cell state to expose as hₜ.
Honest Limitations
The gradient highway breaks down if the forget gate is consistently near 0 at some timestep — even though the cell state update is additive, ∂Cₜ/∂Cₜ₋₁ = fₜ still multiplies the gradient by fₜ at every step, so a forget gate that's genuinely near 0 (correctly erasing information) also genuinely cuts off gradient flow backward through that dimension at that timestep. This is usually the right behavior (information really was irrelevant), but it means the gradient highway is not an unconditional guarantee — it depends on what the forget gate has learned to do.
The forget gate and input gate are computed independently and can both be close to 1 simultaneously — "keep everything old" and "add everything new" are not mutually exclusive, and a cell state that keeps accumulating both without ever trimming can grow large in magnitude over long sequences, pushing the cell state into a regime where tanh(Cₜ) in the output gate saturates and stops discriminating between values.
Test Your Understanding
-
Why does the candidate memory use tanh while the input gate uses sigmoid, given that they're combined by simple element-wise multiplication right after?
-
In the anchor, dimension 0 had C̃ₜ = -0.1749 and iₜ = 0.4295, giving new_info = -0.0751. If iₜ[0] were instead 0.95 (near-full acceptance), what would new_info[0] become, and what would the final Cₜ[0] be (using C_retained[0] = 0.2862)?
-
The gradient highway formula ∂Cₜ/∂Cₜ₋₁ = fₜ shows the cell-state gradient depends only on the forget gate, not the input gate. Why doesn't the input gate's value appear in this derivative, even though iₜ⊙C̃ₜ is added to Cₜ at every step?
-
Suppose a sequence has 50 timesteps where the forget gate consistently equals 0.9. Using the gradient highway formula repeatedly, estimate the cumulative gradient scaling factor from timestep 50 back to timestep 1. Compare this to the vanilla RNN's 0.7⁵⁰ from post 01 — which shrinks less?
-
A trained LSTM shows input gate values near 1 and forget gate values near 1 simultaneously across almost every timestep of a long sequence. Given the honest limitation above about cell state growth, what training or architectural change would you investigate to check whether this is actually a problem for that model?