~/blog

Forget Gate

Jul 3, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The cell state carries information forward across timesteps, but not everything that was relevant a moment ago stays relevant. A language model tracking "she has been studying" needs to hold onto the subject's gender to generate the right pronoun later — until the subject changes. The forget gate is the mechanism that decides, dimension by dimension, what part of the cell state is still useful and what should be cleared out to make room.

Anchor: a language model predicting the next word after "She has been studying. He..." The cell state going into this step still encodes "she/her" in one of its dimensions. The moment "He" arrives, that dimension needs to be erased.

  • hₜ₋₁ = [0.3, -0.1, 0.2, 0.5] (hidden state right before "He")
  • xₜ = [0.8, 0.1, -0.3, 0.6] (embedding for "He")
  • h_dim = 4, x_dim = 4 → concatenated input is 8-dimensional
  • Cₜ₋₁ = [0.6, -0.4, 0.8, 0.2] — dimension 0 carries the gender signal

Formula

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)

Wf maps the 8-dimensional concatenated input down to a 4-dimensional vector — one value per cell-state dimension — and sigmoid squashes each of those 4 values into (0, 1).


What It Does

Each entry of fₜ is a per-dimension "how much to keep" value:

  • fₜ = 1 → keep this dimension of the cell state completely
  • fₜ = 0 → erase this dimension completely
  • fₜ = 0.3 → keep 30%, forget 70%

The retained cell state is Cₜ_from_old = fₜ ⊙ Cₜ₋₁ — an element-wise scaling, not a replacement. The forget gate never decides what new information comes in (that's the input gate's job, next post); it only controls what survives from before.


Numerical Computation

Using weights hand-tuned to illustrate a clear topic-switch: dimension 0 (gender) is wired to respond strongly to the "He" embedding, while dimensions 1–3 are wired to stay mostly retained.

zf = Wf·[hₜ₋₁, xₜ] + bf = [-4.60, 2.21, 2.47, 2.63]

fₜ = σ(zf) = [0.0100, 0.9011, 0.9220, 0.9328]

Cₜ_retained = fₜ ⊙ Cₜ₋₁:

dimfₜCₜ₋₁Cₜ_retained% retained
0 (gender)0.01000.60.00601.0%
10.9011-0.4-0.360590.1%
20.92200.80.737692.2%
30.93280.20.186693.3%

fₜ[0] = 0.01 means the gender dimension retains only 1% of its previous value — the "she" signal is effectively erased the instant "He" appears. Dimensions 1–3, encoding whatever else was relevant (verb tense, topic), stay almost fully intact because nothing in this input suggests they should change.

Forget Gate — Selective Erasure of Cell State Cₜ₋₁ (4 dims) dim0 (gender): 0.6 dim1: -0.4 dim2: 0.8 dim3: 0.2 fₜ = σ(...) 0.0100 0.9011 0.9220 0.9328 ⊙ → Cₜ_retained dim0: 0.006 (erased, 1%) dim1: -0.3605 (90.1%) dim2: 0.7376 (92.2%) dim3: 0.1866 (93.3%) "He" triggers near-total erasure of the gender dimension only

Why Sigmoid is the Right Choice

Sigmoid's output range (0, 1) directly matches "how much to keep" semantics — there's no need to rescale or reinterpret the output. It's also differentiable everywhere, which a hard 0/1 switch would not be — gradients can flow through the gate during training, letting the network learn when to forget instead of following a fixed rule.

In practice the gate tends to become confident (values near 0 or near 1) for clear cases — like the gender switch above — while staying intermediate (0.3–0.7) when the training signal for whether to forget is ambiguous.


What the Forget Gate Learns

There's no hand-coded rule for "erase gender on pronoun switch" — the network discovers this behavior from training data, because forgetting the stale gender signal reduces prediction error on sentences that follow. Empirically, trained LSTMs show recognizable patterns: forget gates stay near 1 within a single short sentence (nothing needs erasing yet), and drop toward 0 at points that correlate with topic shifts, sentence boundaries, or contradicting information — exactly where a human reader would also mentally "reset."


Code

python
import numpy as np

def sigmoid(z): return 1/(1+np.exp(-z))

h_prev = np.array([0.3, -0.1, 0.2, 0.5])
x = np.array([0.8, 0.1, -0.3, 0.6])
C_prev = np.array([0.6, -0.4, 0.8, 0.2])

# Weights hand-tuned to isolate a clear gender-erasure signal on dim 0
Wf = np.array([
    [-2.0, -2.0, -2.0, -2.0, -3.0, -1.0, -1.0, -1.0],
    [ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1],
    [ 0.5, -0.5,  0.3, -0.2,  0.4, -0.1,  0.2,  0.1],
    [ 0.3,  0.3,  0.3,  0.3,  0.3,  0.3,  0.3,  0.3],
])
bf = np.array([0.0, 2.0, 2.0, 2.0])

inp = np.concatenate([h_prev, x])
z_f = Wf @ inp + bf
f_t = sigmoid(z_f)
C_retained = f_t * C_prev

print("Forget gate f_t:", np.round(f_t, 4))
print("C_prev:          ", C_prev)
print("C retained:       ", np.round(C_retained, 4))
print("% retained per dim:", np.round(f_t*100, 1))
text
Forget gate f_t: [0.01   0.9011 0.922  0.9328]
C_prev:           [ 0.6 -0.4  0.8  0.2]
C retained:        [ 0.006  -0.3605  0.7376  0.1866]
% retained per dim: [ 1.  90.1 92.2 93.3]

Hyperparameter Sensitivity: Forget Bias Offset

A common initialization trick adds a constant offset to every entry of bf so the forget gate starts out biased toward "keep everything" early in training (it's easier to learn to forget selectively later than to recover information erased too early). Applying that same offset here shows both the benefit and the failure mode.

python
bf_base = np.array([0.0, 2.0, 2.0, 2.0])  # bf from the code above

for offset in [-2, 0, 2, 4]:
    bf = bf_base + offset
    z_f = Wf @ inp + bf
    f_t = sigmoid(z_f)
    print(f"offset={offset:+d}  f_t={np.round(f_t, 4)}")
text
offset=-2  f_t=[0.0014 0.5523 0.6154 0.6525]
offset=+0  f_t=[0.01   0.9011 0.922  0.9328]
offset=+2  f_t=[0.0691 0.9854 0.9887 0.9903]
offset=+4  f_t=[0.3543 0.998  0.9985 0.9987]

At offset -2, dimensions 1-3 drop from ~90% retained to 55-65% retained — information that should have survived gets partially erased along with the gender signal. At offset +2 the gender dimension still erases cleanly (fₜ[0] = 0.069) while the other dimensions saturate closer to 1, which is the intended effect of the trick. Push the offset too far — offset +4 — and fₜ[0] climbs to 0.35: the bias term now dominates the weighted input enough that even a strong erasure signal can't pull the gate below the "mostly retained" range. Too negative an offset erases indiscriminately; too positive an offset makes the gate unable to forget even when the input clearly calls for it.


Where this builds from: The full LSTM equation set (post 02) placed the forget gate as the first of four operations acting on the cell state — this post isolates it. Sigmoid's (0,1) range and its role as a differentiable valve was introduced in the activations section of this series.

Where this leads: The forget gate only decides what's erased — it never adds anything back. The input gate and candidate memory (next post) work in tandem with the forget gate's output to complete the cell state update: Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ.


Honest Limitations

A forget gate can learn to never forget (fₜ ≈ 1 for all dimensions, always) if the training data never actually requires erasing old information — for instance, on sequences that are always short or never contain topic shifts. In that regime the forget gate contributes nothing beyond a near-identity pass-through, and the model gets no benefit from having a forget mechanism at all; checking gate value statistics during training reveals this.

Learning to forget at the right moment requires enough training examples of the actual context switch — a model that has rarely seen pronoun/subject changes in training data may fail to drop the forget gate at exactly the point a human reader would expect, producing subtly wrong long-range predictions that are hard to diagnose without inspecting gate activations directly.

The cell state has a fixed dimension, set once as a hyperparameter — if the sequence's true long-term memory requirements exceed that capacity, the forget gate has no choice but to overwrite older information to make room for new, regardless of how well it has learned to forget selectively; scaling cell-state size helps only up to the point where compute and overfitting risk make it impractical.


Test Your Understanding

  1. Why does the forget gate use element-wise multiplication with Cₜ₋₁ instead of, say, subtracting a learned vector from it?

  2. In the anchor computation, dimension 0 had zf = -4.60 giving fₜ[0] = 0.01. What zf value would be needed to get fₜ[0] = 0.5 (50% retained), and what does a gate value of exactly 0.5 imply the network has "decided" about that dimension?

  3. Suppose Cₜ₋₁ = [0.6, -0.4, 0.8, 0.2] and the forget gate instead produced fₜ = [1.0, 1.0, 1.0, 1.0] for every timestep across a very long sequence. What does this imply about the cell state's growth over time, and what problem might that eventually cause even though gradients aren't vanishing?

  4. A model shows forget gate values consistently in the 0.4–0.6 range across an entire evaluation set, never approaching 0 or 1. Is this necessarily a sign the model is undertrained, or could it indicate something else about the task?

  5. Given that the forget gate and input gate (next post) both operate on the same concatenated input [hₜ₋₁, xₜ] but serve opposite purposes (erase vs. add), could a single gate handle both jobs by outputting values in [-1, 1] instead of two separate [0,1] gates? What would break about the cell state update if this were tried?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment