~/blog
Problems with Encoder-Decoder
The seq2seq architecture from the previous post works — and was a genuine breakthrough for machine translation when it appeared. But it has a specific, well-documented failure mode: performance degrades sharply as input sequences get longer, and the degradation isn't a training or data issue — it's structural, baked into the design of squeezing an entire sentence through one fixed-size vector. Understanding exactly why sets up why attention, the next major idea in sequence modeling, looks the way it does.
Problem 1: The Information Bottleneck
Every bit of information about the input sequence has to pass through c — a single fixed-size vector, regardless of input length. For "I love deep learning" (4 tokens) compressed into a 256-dim vector, that's a generous amount of representational room per token. For a 200-token paragraph compressed into the same 256-dim vector, it's a catastrophic squeeze — roughly 50× less representational capacity per token, with no mechanism to allocate more space when the input happens to be longer.
Problem 2: Long-Range Memory in the Encoder
The context vector c = h_T is the encoder's final hidden state — and like any recurrent state, it's built by repeatedly updating on each new token, which means older information gets progressively diluted by everything that came after it. In "She was born in France, lived in Germany, and currently works in..." — a sentence where "France" appears at position 1 but the relevant context vector isn't computed until position 50 — "France" has been through 49 additional recurrent updates by the time c is formed. Even without full gradient vanishing (LSTM's gates help there), the forward representation of "France" specifically has been mixed with, and increasingly dominated by, everything that followed it.
A simplified linear simulation makes the decay concrete — tracking how much a first-token signal's influence survives after T recurrent steps, using a weight matrix scaled to a realistic contraction (spectral radius 0.9):
import numpy as np
def first_token_retention(T, h_dim=4, seed=42):
np.random.seed(seed)
A = np.random.randn(h_dim, h_dim)
eigvals = np.linalg.eigvals(A)
Wh = A * (0.9 / np.max(np.abs(eigvals))) # spectral radius 0.9
M = np.eye(h_dim)
for _ in range(T):
M = Wh @ M
return np.linalg.norm(M, ord=2)
for T in [5, 10, 20, 50]:
r = first_token_retention(T)
print(f"T={T:2d}: first-token influence retained in c = {r:.4f}")T= 5: first-token influence retained in c = 1.0974
T=10: first-token influence retained in c = 0.4753
T=20: first-token influence retained in c = 0.1433
T=50: first-token influence retained in c = 0.0061By T=50, the first token's influence on the final context vector has shrunk to roughly 0.6% of its original magnitude — "France" is functionally gone from c by the time the decoder needs to translate the sentence's ending, even though it may be the single most important word for a downstream fact like nationality.
Problem 3: Fixed Context at Every Decoder Step
The decoder receives the exact same c at every single generation step — whether it's producing the first word (which likely needs to identify the subject) or the fifth word (which might need a completely different part of the input, like a location or object). A single fixed vector has no mechanism to be selective across decoding steps; it presents identical information regardless of what the decoder is currently trying to generate.
The Compounding Error Problem
Training uses teacher forcing (post 02): the decoder is always fed the ground-truth previous token, regardless of what it would have actually predicted. At inference, there's no ground truth — the decoder feeds back its own prediction, which may already be wrong. Once an early token in the generated sequence is wrong, every subsequent step conditions on that wrong token, and errors compound rather than self-correct. This mismatch between training conditions (always correct history) and inference conditions (possibly wrong history) is called exposure bias — the model has literally never seen its own mistakes during training, so it has no learned behavior for recovering from them.
Quantitative Evidence
Sutskever et al. (2014) documented degrading BLEU scores for standard seq2seq specifically on sentences longer than roughly 20 words. Bahdanau et al. (2015) showed that adding attention recovers most of that lost performance on long sentences:
| Sentence length | Seq2Seq BLEU | Seq2Seq+Attention BLEU |
|---|---|---|
| 1–10 | 42.5 | 43.2 |
| 11–20 | 38.1 | 41.7 |
| 21–30 | 26.4 | 38.9 |
| 31+ | 12.8 | 33.1 |
Without attention, BLEU drops from 42.5 to 12.8 — a 70% relative collapse — as sentences grow from short to long. With attention, the same range only drops from 43.2 to 33.1, a much smaller 23% relative decline. The gap between the two columns widens specifically as sentence length increases — exactly matching the bottleneck-severity argument above.
The Solution: Attention (Preview)
Instead of compressing everything into one fixed context vector c used identically at every step, attention computes a different weighted combination of all encoder hidden states at each decoding step:
αᵢₜ = softmax(score(hᵢ_enc, hₜ_dec)) — alignment weights, one per encoder position, computed fresh at every decoder step t
cₜ = Σᵢ αᵢₜ · hᵢ_enc — a dynamic, step-specific context vector
This directly answers all three structural problems above: no single fixed-size bottleneck (every encoder hidden state stays available, not just the final one), no long-range dilution (early tokens' hidden states are directly accessible at any later decoding step, not filtered through T recurrent updates), and genuine per-step selectivity (αᵢₜ changes at every decoding step, letting the decoder focus on whichever input tokens are relevant right now). Full mechanics are covered in this site's Transformers series.
Hyperparameter Sensitivity: Does a Bigger Context Vector Fix This?
The obvious fix to try first is making c bigger — if 256 dimensions isn't enough room, use 1024. The retention simulation above isolates whether that actually addresses the decay, by re-running it at different values of h_dim while holding the spectral radius (0.9) and number of steps (T=50) fixed:
for h_dim in [4, 16, 64, 256]:
r = first_token_retention(50, h_dim=h_dim)
print(f"h_dim={h_dim:3d}: first-token influence retained in c = {r:.4f}")h_dim= 4: first-token influence retained in c = 0.0061
h_dim= 16: first-token influence retained in c = 0.0068
h_dim= 64: first-token influence retained in c = 0.0059
h_dim=256: first-token influence retained in c = 0.0063The retained fraction stays pinned near 0.6% regardless of dimension. That's because the decay is driven by the recurrent weight matrix's spectral radius (0.9) compounding over T=50 steps — a property of how many times the state gets overwritten, not of how much room the vector has. Widening c gives each token more space to be represented at any single step, but it does nothing to stop 49 subsequent overwrites from drowning out the first one. This is exactly why the "larger hidden dimensions delay the problem but don't eliminate it" limitation below holds: at extreme values (h_dim in the thousands), you buy a temporary reprieve for moderately long sequences, but the underlying recurrent-overwrite dynamic is untouched, so sufficiently long sequences still collapse it.
Related Concepts
Where this builds from: This post is a direct diagnosis of the encoder-decoder architecture from the previous post — every problem described here traces back to a specific structural choice made there (single context vector, single fixed use of it, teacher-forcing training).
Where this leads: Attention, previewed above, is the mechanism that fixes all three structural problems by replacing the fixed context vector with a dynamically weighted one — the Transformers series on this site covers its full derivation and eventual generalization into self-attention and the Transformer architecture.
Honest Limitations
The information bottleneck causes progressively worse information loss as a direct function of input length — there is no threshold fix within the standard encoder-decoder architecture itself; larger hidden dimensions delay the problem but don't eliminate it, since any fixed-size vector eventually saturates relative to a long enough input.
Exposure bias grows worse specifically for longer output sequences — an early wrong token in a 3-word translation has limited room to compound, but the same wrong token early in a 50-word document summary can derail the entire remaining output, and the fixed context vector provides no mechanism to help the decoder recover once it has drifted from the correct sequence.
Attention solves the bottleneck and selectivity problems but introduces O(T²) memory and compute cost — computing alignment weights between every decoder step and every encoder position scales quadratically with sequence length, which became its own bottleneck at very long sequence lengths; this is precisely the problem that motivated later efficient-attention variants used in modern Transformer architectures.
Test Your Understanding
-
Why does the information bottleneck problem get worse with input length even though the context vector's dimensionality (e.g., 256) never changes?
-
Using the first-token retention simulation's numbers (1.0974 at T=5, 0.0061 at T=50), estimate roughly at what sequence length T the retained influence would drop below 0.10, assuming the same geometric decay rate implied by these values.
-
The BLEU comparison table shows attention's advantage widening as sentences get longer (0.7-point gap at 1–10 tokens vs 20.3-point gap at 31+ tokens). Explain why attention's benefit is nearly negligible on short sentences but dramatic on long ones, connecting back to Problem 1.
-
A team building a document summarization system (500+ token inputs, 50+ token outputs) is deciding between a plain seq2seq LSTM and a seq2seq+attention model. Using the three problems and the BLEU evidence in this post, make the case for which one is likely to fail and why, specifically identifying which of the three problems is most severe at this scale.
-
Exposure bias is described as a training/inference mismatch, not a capacity problem. Suppose a team tries to fix it by training with a larger hidden dimension (more capacity) rather than addressing the training procedure. Explain why increasing hidden dimension would not actually fix exposure bias, using the definition of exposure bias given in this post.