~/blog

Encoder-Decoder (Seq2Seq) — In-depth Intuition

Jul 3, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Every architecture covered so far — classification, tagging, even bidirectional context — assumes there's a natural one-to-one or fixed relationship between input and output. Translation breaks that assumption outright: "I love deep learning" (4 tokens) becomes "J'aime l'apprentissage profond" (3 tokens) in French. There's no position-by-position mapping possible. Seq2Seq exists specifically to decouple input length from output length.

Anchor: machine translation.

Input: [I, love, deep, learning] (4 tokens) Output: [J'aime, l'apprentissage, profond] (3 tokens)


The Problem: Variable-Length Input and Output

A standard RNN produces one output per input timestep — that works for tagging (post 01's NER example: one tag per word) but not for translation, where the number of input tokens and output tokens are simply unrelated. Summarization and speech recognition have the same shape: input length and output length vary independently. Seq2Seq solves this with two separate components instead of one shared recurrence.

Encoder-Decoder — Context Vector as Information Bottleneck Encoder I love deep learning c 4-dim (fixed) Decoder SOS J'aime l'apprentiss. profond Every bit of encoder information must pass through this one vector

The Encoder

The encoder processes the input sequence token by token, producing a hidden state at every step, and keeps only the final hidden state as a fixed-size summary:

Encoder LSTM processes [I, love, deep, learning] → h₁, h₂, h₃, h₄

c = h₄ — the context vector. Everything the encoder learned about the entire input sentence has to be compressed into this single fixed-size vector, regardless of whether the input sentence was 4 tokens or 400. If h_dim = 256, c is always a 256-dimensional vector.

Using the anchor with h_dim=4 (small for illustration), each step applies hᵢ = tanh(W·xᵢ + hᵢ₋₁):

StepInput tokenFormulaResult (hᵢ)
1Itanh(W·x₁ + h₀), h₀=[0,0,0,0][-0.0731, 0.0064, 0.0692, -0.1040]
2lovetanh(W·x₂ + h₁)[0.0489, -0.0628, -0.0702, -0.1063]
3deeptanh(W·x₃ + h₂)[0.0212, -0.2393, -0.3720, -0.0697]
4learningtanh(W·x₄ + h₃)[0.2006, -0.2332, -0.4004, -0.0187]

c = h₄ = [0.2006, -0.2332, -0.4004, -0.0187]


The Decoder

The decoder takes c as its initial hidden state and generates output tokens one at a time, autoregressively:

hᵢ_dec = LSTM(yᵢ₋₁, hᵢ₋₁_dec), ŷᵢ = softmax(W·hᵢ_dec)

Each output token is generated using the previous output token as input — the decoder has to produce its own sequence step by step, unlike the encoder which sees the whole input at once.


Training vs Inference

ModeDecoder input at step iWhy
TrainingGround truth yᵢ₋₁ (teacher forcing)Faster convergence, avoids error accumulation
InferencePredicted ŷᵢ₋₁ (greedy or beam)No ground truth available

During training, the decoder is fed the correct previous token regardless of what it would have predicted — this is teacher forcing, and it prevents early mistakes from cascading into a training signal that's dominated by compounding errors. At inference, there is no ground truth to feed — the decoder must use its own (possibly wrong) previous prediction as the next input, which is exactly the exposure-bias gap flagged in post 06's limitations.


Beam Search (Brief)

Greedy decoding always picks the single highest-probability token at each step — fast, but locally optimal decisions can lead to a globally worse full sequence.

Beam search keeps the top-k partial sequences at each step (not just the single best), expanding each candidate and re-pruning to the top-k again, only committing to a final sequence at the end. With beam=2 on the anchor translation, two candidate partial translations would be tracked simultaneously at every decoding step — for instance, one starting with "J'aime" and another with a close second-choice token — and the sequence with the highest overall probability across all 3 steps is chosen at the end, rather than whichever looked best one step at a time.


Architecture Variants

Bidirectional encoder (common): the encoder itself can be bidirectional (post 01) — since the entire input is available upfront, there's no streaming constraint against it. Context becomes c = [h→_T; h←_1], combining the final forward state with the final backward state.

Stacked layers: 2–4 layer encoders and decoders are typical, following the same depth-adds-abstraction reasoning from post 07.

GRU instead of LSTM: often near-equivalent performance with fewer parameters, following post 08's GRU tradeoffs.


Code

python
import numpy as np

def sigmoid(z): return 1/(1+np.exp(-z))
def softmax(z): e=np.exp(z-z.max()); return e/e.sum()

np.random.seed(42)
h_dim, vocab_size = 4, 6  # small for illustration

# Encoder: 4 input tokens → context vector
encoder_input = np.random.randn(4, 3)  # 4 tokens, 3-dim embeddings
W_enc = np.random.randn(h_dim, 3) * 0.1

h = np.zeros(h_dim)
for x in encoder_input:
    h = np.tanh(W_enc @ x + h)  # simplified (no separate Wh for brevity)
context = h
print("Context vector c:", np.round(context, 4))

# Decoder: 3 output tokens
W_dec = np.random.randn(h_dim, h_dim) * 0.1
W_out = np.random.randn(vocab_size, h_dim) * 0.1
h_dec = context.copy()

sos = np.zeros(h_dim)  # start-of-sequence token
h_dec = np.tanh(W_dec @ sos + h_dec)
for step in range(3):
    logits = W_out @ h_dec
    probs = softmax(logits)
    pred = np.argmax(probs)
    print(f"Step {step+1}: token={pred}, probs={np.round(probs,3)}")
    h_dec = np.tanh(W_dec @ h_dec + context)  # feed context at each step
text
Context vector c: [ 0.2006 -0.2332 -0.4004 -0.0187]
Step 1: token=5, probs=[0.169 0.167 0.173 0.156 0.163 0.173]
Step 2: token=2, probs=[0.169 0.168 0.172 0.157 0.164 0.171]
Step 3: token=2, probs=[0.169 0.168 0.171 0.157 0.164 0.171]

With small random weights and no training, the probabilities are nearly uniform across the vocabulary — this code demonstrates the mechanics (context vector construction, autoregressive decoding steps), not a trained model's actual translation quality.


Hyperparameter Sensitivity: Context Vector Size (h_dim)

h_dim controls how much of the input sentence the context vector can retain. Too small and the encoder is forced to throw away information before the decoder ever sees it; too large and training cost grows with no benefit on a short anchor sentence like this one:

python
for h_dim in [1, 2, 4, 8, 16, 64]:
    np.random.seed(42)  # same weight draw per h_dim for a fair comparison
    W_enc = np.random.randn(h_dim, 3) * 0.1
    h = np.zeros(h_dim)
    for x in encoder_input:
        h = np.tanh(W_enc @ x + h)
    print(f"h_dim={h_dim:>2}: ||c||={np.linalg.norm(h):.4f}, c[:2]={np.round(h[:2], 4)}")
text
h_dim= 1: ||c||=0.1688, c[:2]=[0.1688]
h_dim= 2: ||c||=0.5564, c[:2]=[0.1688 0.5302]
h_dim= 4: ||c||=0.8137, c[:2]=[0.1688 0.5302]
h_dim= 8: ||c||=0.9578, c[:2]=[0.1688 0.5302]
h_dim=16: ||c||=1.1391, c[:2]=[0.1688 0.5302]
h_dim=64: ||c||=2.3519, c[:2]=[0.1688 0.5302]

At h_dim=1, the entire 4-token sentence collapses to a single scalar — every distinct input sentence maps to a near-identical sliver of information, and the decoder has almost nothing to condition on beyond a rough magnitude. As h_dim grows, the same first two directions stay put but new dimensions are added on top, so the vector's norm keeps climbing and the context has more room to encode distinguishable structure — which is why real translation systems use h_dim in the hundreds (256–1024), not single digits. Past a certain point, most of a short sentence's information already fits in a moderate h_dim; the added dimensions mostly carry redundant or noisy directions, so scaling further buys little without a proportionally longer or more complex input to justify it.


Where this builds from: The encoder and decoder are each ordinary LSTM/GRU stacks (post 02, post 08) — the seq2seq framework's novelty is entirely in how they're connected through the context vector, not in the recurrent cells themselves. A bidirectional encoder (post 01) is a direct, common enhancement.

Where this leads: The context vector bottleneck described here is the central weakness explored in the next post — one fixed-size vector cannot represent arbitrarily long input sequences without losing information, and that specific problem motivates the attention mechanism, covered in this site's Transformers series.


Honest Limitations

The fixed-size context vector cannot scale to long input sequences without losing information — compressing a 4-token sentence into a 256-dim vector loses relatively little, but compressing a 100-token paragraph into the same 256-dim vector forces the encoder to discard most of the detail; BLEU scores for seq2seq translation models measurably degrade once input length exceeds roughly 20 tokens.

The decoder uses the exact same context vector c at every single output step — it cannot selectively focus on the part of the input most relevant to the token it's currently generating (e.g., attending more to "learning" when generating "profond"). This is the specific problem attention mechanisms solve, discussed in the next post.


Test Your Understanding

  1. Why must the encoder's final hidden state (not, say, the average of all hidden states) serve as the context vector in the basic seq2seq architecture — what property of the recurrence makes h_T special?

  2. Given h_dim=256, compute how many total values the context vector must represent regardless of whether the input sentence is 5 tokens or 50 tokens. What does this imply about the "information per input token" the context vector can preserve as sentences get longer?

  3. In the anchor's code output, both steps 2 and 3 predict the same token (token=2). Given that the decoder feeds back h_dec (not the ground-truth translation) at each step during untrained inference, explain mechanically why the decoder might get "stuck" producing the same or similar predictions repeatedly.

  4. A team trains a seq2seq translation model exclusively with teacher forcing and never evaluates using greedy or beam decoding during development. At deployment, translation quality is noticeably worse than validation loss suggested. What mismatch between training and inference conditions, described in this post, likely explains the gap?

  5. Beam search with beam=2 tracks two candidate partial sequences instead of one. Construct a short example (using the anchor's translation task) where greedy decoding's single top-1 choice at an early step could lock in a suboptimal full sentence, while beam search's broader tracking would recover the better final translation.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment