~/blog

NLL Loss and Perplexity

Jul 3, 2026•6 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

A language model assigns a probability to every possible next token given the context seen so far. Its job during training is to assign high probability to the actual next token in the training data. Negative log-likelihood (NLL) measures how well it does this: for each position in the sequence, take the negative log of the probability assigned to the true token, and sum across all positions. Lower NLL means higher probability assigned to the right tokens.

Perplexity is NLL's interpretable cousin. After computing average NLL per token, exponentiate it: PPL = exp(avg NLL). Intuitively, perplexity is the effective number of equally likely choices the model feels it has at each step. A perplexity of 5 means the model is as uncertain as if it had to randomly choose among 5 equally likely tokens at every position.

Anchor: language model predicting next tokens in "The cat sat". Vocabulary: {The, cat, sat, dog, ran}, V=5.

text

P(cat | The)           = 0.4
P(sat | The, cat)      = 0.3
P(<EOS> | The, cat, sat) = 0.25

Negative Log-Likelihood

A sentence probability factorizes by the chain rule:

P(sentence) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · ... = Π P(wₜ|w₁,...,wₜ₋₁)

Taking the log turns the product into a sum:

log P = Σₜ log P(wₜ|w₁,...,wₜ₋₁)

Maximizing log P is equivalent to minimizing the negative log-likelihood:

NLL = −Σₜ log P(wₜ|context)

Computing on anchor:

−log(0.4) = −(−0.9163) = 0.9163
−log(0.3) = −(−1.2040) = 1.2040
−log(0.25) = −(−1.3863) = 1.3863

Total NLL = 0.9163 + 1.2040 + 1.3863 = 3.5066

NLL Is Cross-Entropy

Standard cross-entropy for classification: CE = −Σ_k y_k · log(p_k)

When the label is one-hot (y_k=1 for the true class, 0 otherwise), all terms drop except the true class: CE = −log(p_{true class})

In a language model, each token prediction is exactly this: the label is the next actual token (one-hot), and the loss is −log of the probability assigned to that token. So:

NLL per position = CE per position

At position 1 (predicting "cat" after "The"):

One-hot: [0, 1, 0, 0, 0] for {The, cat, sat, dog, ran}
CE = −(0·log P(The) + 1·log P(cat) + 0·...) = −log(0.4) = 0.9163

This is identical to the NLL term at that position.

Trace Table

Pos	Token	P(token\|context)	log P	−log P (NLL)
1	cat	0.4000	−0.9163	0.9163
2	sat	0.3000	−1.2040	1.2040
3	<EOS>	0.2500	−1.3863	1.3863
Total				3.5066
Mean				1.1689

Average NLL per token = 3.5066 / 3 = 1.1689 nats

Perplexity

PPL = exp(NLL per token) = exp(−(1/T) Σ log P(wₜ|context))

Computing on anchor: PPL = exp(1.1689) = 3.219

The model has an effective branching factor of about 3.2 at each step — as uncertain as randomly choosing among ~3 equally likely tokens.

Three Reference Cases

Perfect model: P(true_token) = 1.0 at every position.

NLL per token = −log(1.0) = 0
PPL = exp(0) = 1

Random model over V=5: P = 0.2 for every token at every step.

NLL per token = −log(0.2) = 1.6094
PPL = exp(1.6094) = 5.0 = V (the vocabulary size)

Anchor model: PPL = 3.219 — better than random (3.2 < 5), worse than perfect (3.2 > 1).

Why LLMs Report Perplexity

NLL is in nats (or bits if log base 2 is used). Comparing "2.3 nats per token" vs "2.1 nats per token" requires knowing what scale of improvement is meaningful. Perplexity transforms this into an effective branching factor that is interpretable without domain knowledge.

Real model benchmarks on WikiText-103 (large English corpus):

GPT-2 small (117M parameters): PPL ≈ 29
GPT-3 (175B parameters): PPL ≈ 20
LLaMA-2 70B: PPL ≈ 3.3 (on some domain-specific benchmarks)

Improvements are multiplicative: going from PPL=30 to PPL=15 halves the effective branching factor. Going from PPL=3 to PPL=2 also halves it, but that second improvement is far harder to achieve.

Code

python

import numpy as np

# Anchor: probabilities assigned by LM to true next tokens
true_token_probs = [0.4, 0.3, 0.25]  # P(cat|The), P(sat|The,cat), P(<EOS>|The,cat,sat)
tokens = ["cat", "sat", "<EOS>"]

log_probs = [np.log(p) for p in true_token_probs]
nll_per_token = [-lp for lp in log_probs]
total_nll = sum(nll_per_token)
avg_nll = total_nll / len(tokens)
perplexity = np.exp(avg_nll)

print(f"{'Pos':>3} | {'Token':>6} | {'P(token)':>9} | {'log P':>8} | {'NLL':>8}")
for i, (tok, p, lp, nll) in enumerate(zip(tokens, true_token_probs, log_probs, nll_per_token)):
    print(f"{i+1:>3} | {tok:>6} | {p:>9.4f} | {lp:>8.4f} | {nll:>8.4f}")
print(f"\nTotal NLL:   {total_nll:.4f}")
print(f"Avg NLL:     {avg_nll:.4f}")
print(f"Perplexity:  {perplexity:.4f}")

# Three reference cases
print("\nReference cases:")
print(f"  Perfect (P=1.0):      PPL = {np.exp(-np.log(1.0)):.1f}")
print(f"  Random (P=0.2, V=5):  PPL = {np.exp(-np.log(0.2)):.1f}")
print(f"  Anchor model:         PPL = {perplexity:.4f}")

text

Pos |  Token |  P(token) |    log P |      NLL
  1 |    cat |    0.4000 |  -0.9163 |   0.9163
  2 |    sat |    0.3000 |  -1.2040 |   1.2040
  3 |  <EOS> |    0.2500 |  -1.3863 |   1.3863

Total NLL:   3.5066
Avg NLL:     1.1689
Perplexity:  3.2194

Reference cases:
  Perfect (P=1.0):      PPL = 1.0
  Random (P=0.2, V=5):  PPL = 5.0
  Anchor model:         PPL = 3.2194

NLL is mechanically identical to cross-entropy loss (03-classification-losses.md) applied token-by-token. Softmax (03-activations/07-softmax.md) converts the model's logits into the probability distribution P over the vocabulary before NLL is computed. Perplexity is the primary language model training metric, but it doesn't directly measure generation quality — models are also evaluated with BLEU (n-gram overlap with references), ROUGE, and human evaluation, because low perplexity on next-token prediction doesn't guarantee good generation.

Honest Limitations

Perplexity is not comparable across models with different vocabularies or tokenizers. GPT-2 with a 50,000-token vocabulary and BERT with a 30,000-token vocabulary cannot be meaningfully compared on raw perplexity because a finer vocabulary means each token is harder to predict (more choices). A model with a 100,000-token vocabulary could have identical PPL to a 50k-token model while being strictly better at English — the larger vocabulary makes each prediction nominally harder.

Perplexity averages over all token positions equally. A single token with very low probability (like a rare proper noun with P=0.001) contributes NLL=6.9 to the sum — the same as 7 tokens with P=0.37. For language modeling of specialized domains (medical, legal, code), a handful of rare tokens can dominate the perplexity score while the model performs well on common language patterns.

Low perplexity on a held-out set does not guarantee good generation quality. A model can achieve low perplexity by assigning moderate probability to many plausible continuations (safe predictions), while a model that generates coherent long-form text might assign higher probability to one specific good continuation and lower probability to many others. PPL and generation quality are correlated but not equivalent — evaluating generation requires BLEU, ROUGE, or human evaluation.

Test Your Understanding

A language model assigns P(cat|The)=0.4, P(mat|The)=0.1. The true next word is "cat". Compute the NLL contribution at this position. Now compute for P(cat|The)=0.8. How much does the loss change?
Compute perplexity for a 3-token sequence where the model assigns P=0.5, P=0.5, P=0.5 to the true tokens. What is the effective branching factor? What does this imply about the model's confidence?
A model trained on English Wikipedia achieves PPL=15 on WikiText. Evaluated on Python code, it achieves PPL=150. What does this difference tell you about the model's generalization? Is this expected?
Two models have the same perplexity of 20 on the test set. Model A achieved this with tokens averaging P=0.05 uniformly. Model B achieved this with most tokens at P=0.99 but a few extremely rare tokens at P=0.000001. Which model is better at "typical" text? Which would behave better in generation?
Show algebraically that for a uniform model over V vocabulary items (P(token)=1/V for all tokens at all positions), PPL = V exactly. What happens to PPL as V → ∞ for a fixed "goodness" of predictions?

NLL Loss and Perplexity

Negative Log-Likelihood

NLL Is Cross-Entropy

Trace Table

Perplexity

Three Reference Cases

Why LLMs Report Perplexity

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

NLL Loss and Perplexity

Negative Log-Likelihood

NLL Is Cross-Entropy

Trace Table

Perplexity

Three Reference Cases

Why LLMs Report Perplexity

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment