The cell state now holds everything the LSTM has decided is worth remembering — a mix of long-term signal built up across forget and input gates. But not all of that stored knowledge is relevant to what needs to be output right now. The output gate is the final filter: it decides which parts of the cell state actually get exposed as this timestep's hidden state, the value everything downstream — the next layer, the prediction head, the next timestep — actually sees.
Anchor: continuing from the previous two posts, with Cₜ = [0.2111, -0.2250, 0.4617, 0.1539] now computed. This post finishes the timestep by computing oₜ and hₜ.
Formula
oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo)
hₜ = oₜ ⊙ tanh(Cₜ)
What It Does
Cₜ contains everything the LSTM currently "knows" — but its raw values can be any magnitude, since it's built by repeated addition. tanh(Cₜ) first squashes those values into (-1, 1), giving a bounded, centered representation. The output gate oₜ then selects, dimension by dimension, how much of that squashed cell state gets exposed as hₜ — the same "valve" semantics as the forget and input gates, just applied at the exit point instead of the entry.
Intuition
Cₜ might simultaneously encode tense, grammatical person, and topic — all useful at different points in a sentence. When predicting the next word after "She ___" (needing to produce "is," not "are" or "am"), only the "3rd-person-singular-present" information matters for that specific prediction. The output gate can learn to expose that slot strongly while suppressing the topic-related dimensions that aren't relevant to picking the correct verb form — even though both are sitting in Cₜ.
Numerical Computation
oₜ = σ(Wo·inp + bo) = [0.4663, 0.5538, 0.5330, 0.5062]
tanh(Cₜ) = [0.2080, -0.2213, 0.4315, 0.1527]
hₜ = oₜ ⊙ tanh(Cₜ) = [0.0970, -0.1225, 0.2300, 0.0773]
Dimension 2 has the largest cell-state magnitude after tanh (0.4315) and a moderate output gate value (0.5330), giving it the largest share of the final hidden state (0.2300) — it's currently the dimension most strongly expressed in this timestep's output.
The Full LSTM Step Summary
| Gate/State | Formula | Purpose | Output range |
|---|---|---|---|
| Forget | fₜ = σ(Wf·[hₜ₋₁,xₜ]+bf) | Erase old cell state | (0,1) |
| Input | iₜ = σ(Wi·[hₜ₋₁,xₜ]+bi) | How much new info | (0,1) |
| Candidate | C̃ₜ = tanh(WC·[hₜ₋₁,xₜ]+bC) | What new info | (−1,1) |
| Cell state | Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ | Long-term memory | Any |
| Output | oₜ = σ(Wo·[hₜ₋₁,xₜ]+bo) | What to expose | (0,1) |
| Hidden | hₜ = oₜ⊙tanh(Cₜ) | Short-term output | (−1,1) |
This is the complete LSTM cell, all four gates working together in one pass.
Code
import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))
# continuing from previous posts, same h_prev, x, C_prev, inp, seeds
np.random.seed(7)
Wf = np.random.randn(4,8)*0.1; bf = np.zeros(4)
Wi = np.random.randn(4,8)*0.1; bi = np.zeros(4)
WC = np.random.randn(4,8)*0.1; bC = np.zeros(4)
Wo = np.random.randn(4,8)*0.1; bo = np.zeros(4)
h_prev = np.array([0.3,-0.1,0.2,0.5])
x = np.array([0.8,0.1,-0.3,0.6])
C_prev = np.array([0.6,-0.4,0.8,0.2])
inp = np.concatenate([h_prev, x])
f_t = sigmoid(Wf@inp+bf)
i_t = sigmoid(Wi@inp+bi)
C_tilde = np.tanh(WC@inp+bC)
C_t = f_t*C_prev + i_t*C_tilde
o_t = sigmoid(Wo@inp+bo)
h_t = o_t * np.tanh(C_t)
print("Output gate o_t:", np.round(o_t, 4))
print("tanh(C_t): ", np.round(np.tanh(C_t), 4))
print("h_t: ", np.round(h_t, 4))
print("\nFull LSTM step summary:")
print(f" f_t: {np.round(f_t,4)}")
print(f" i_t: {np.round(i_t,4)}")
print(f" C̃_t: {np.round(C_tilde,4)}")
print(f" C_t: {np.round(C_t,4)}")
print(f" o_t: {np.round(o_t,4)}")
print(f" h_t: {np.round(h_t,4)}")Output gate o_t: [0.4663 0.5538 0.533 0.5062]
tanh(C_t): [ 0.208 -0.2213 0.4315 0.1527]
h_t: [ 0.097 -0.1225 0.23 0.0773]
Full LSTM step summary:
f_t: [0.477 0.4903 0.5527 0.5143]
i_t: [0.4295 0.4924 0.5149 0.5165]
C̃_t: [-0.1749 -0.0586 0.038 0.0988]
C_t: [0.2111 -0.225 0.4617 0.1539]
o_t: [0.4663 0.5538 0.533 0.5062]
h_t: [0.097 -0.1225 0.23 0.0773]Related Concepts
Where this builds from: The forget gate (post 03) and input gate (post 04) together produced Cₜ — the output gate reads from that result but never modifies it further; it only controls the readout.
Where this leads: With all 4 gates now covered individually, the LSTM training process (post 06) shows how backpropagation through time updates all these weight matrices together. LSTM variants (post 07) survey structural changes to this same cell.
Honest Limitations
The output gate only controls what gets exposed from Cₜ — it has no ability to modify what's actually stored there. If the cell state has accumulated noisy or stale information in a dimension the output gate happens to suppress this timestep, that information is still sitting in Cₜ, available (correctly or incorrectly) to leak out at a later timestep when the output gate's decision changes.
hₜ = oₜ⊙tanh(Cₜ) squashes cell-state values into (-1, 1) via tanh, which means very large cell-state magnitudes get compressed and lose relative distinction — a cell state value of 5 and a cell state value of 50 both map to tanh outputs extremely close to 1, so once Cₜ grows large in some dimension, the hidden state can no longer distinguish "somewhat confident" from "very confident" along that dimension.
Test Your Understanding
-
Why is tanh applied to Cₜ before the output gate multiplies it, rather than applying the output gate directly to the raw Cₜ values?
-
Using the anchor's Cₜ = [0.2111, -0.2250, 0.4617, 0.1539], if oₜ were instead [1.0, 1.0, 1.0, 1.0] (fully open on every dimension), what would hₜ equal? How does this compare to the actual hₜ computed with the learned oₜ?
-
The full summary table shows Cₜ has "any" output range while hₜ is bounded to (-1,1). Walk through why the addition-based cell state update permits unbounded growth over many timesteps while the hidden state never does.
-
A trained LSTM shows output gate values consistently near 0 for one specific dimension across all timesteps in a sequence, while the corresponding dimension in Cₜ has a large, clearly meaningful magnitude. What does this suggest about how that dimension of memory is being used by the network?
-
Suppose two dimensions of Cₜ end up numerically identical after tanh (both saturate to values very close to 1) despite representing very different underlying signals accumulated over many timesteps. What does the output gate's downstream behavior look like for these two dimensions, and what does this reveal about a limitation of tanh saturation on the output path specifically (as opposed to the vanishing gradient issue from post 01)?