~/blog
Contrastive Loss
Standard image classification trains a model to output a fixed set of class labels. But many problems don't have a fixed class list. Face verification — "is this the same person as that photo?" — can involve billions of distinct identities, and new identities are added constantly. No classifier can enumerate them all.
The solution is to train a model to produce embeddings: vectors in a space where semantic similarity corresponds to geometric proximity. Same-person face photos should have nearby embeddings. Different-person photos should have distant embeddings. Given two new photos, you compare their embeddings rather than predicting class labels.
Contrastive loss (Hadsell et al., 2006) is the foundational loss for this setup. It takes pairs of samples and a binary label: 1 = same class (positive pair), 0 = different class (negative pair). Similar pairs are pulled together. Dissimilar pairs are pushed apart — but only if they're closer than a margin m. If they're already far apart, no gradient flows.
Anchor: 4 face images in 2D embedding space.
A1 = [0.8, 0.2] # Alice, photo 1
A2 = [0.75, 0.25] # Alice, photo 2
B1 = [0.1, 0.9] # Bob, photo 1
B2 = [0.15, 0.85] # Bob, photo 2Embedding Space
The Formula
For a pair (x₁, x₂) with label y (1=similar, 0=dissimilar) and Euclidean distance D = ‖f(x₁) − f(x₂)‖₂:
L = y · D² + (1 − y) · max(m − D, 0)²
Positive pair (y=1): L = D² — pull the pair together (loss grows with distance)
Negative pair (y=0): L = max(m − D, 0)² — push apart only if D < m; if already far enough, loss = 0
The margin m defines "far enough." Negative pairs further than m contribute zero gradient — the model only spends capacity separating pairs that are actually ambiguous.
Computation on Anchor
Positive pair (A1, A2) — same person:
D = ‖[0.8,0.2] − [0.75,0.25]‖ = √((0.8−0.75)² + (0.2−0.25)²) = √(0.05² + 0.05²) = √(0.0025 + 0.0025) = √0.005 ≈ 0.0707
Loss = 1 × D² = 1 × 0.005 = 0.005 — very small because these photos are already very close.
Negative pair (A1, B1) — different people:
D = ‖[0.8,0.2] − [0.1,0.9]‖ = √((0.7)² + (0.7)²) = √(0.49 + 0.49) = √0.98 ≈ 0.9899
m − D = 1.0 − 0.9899 = 0.0101 — just barely inside the margin.
Loss = (1−0) × max(0.0101, 0)² = 0.0101² ≈ 0.0001 — tiny loss because the pair is almost at the margin.
What if D > m? If D = 1.2 and m = 1.0: max(1.0 − 1.2, 0) = max(−0.2, 0) = 0 → loss = 0. Already far enough, no gradient.
Trace Table
| Pair | label y | D | loss |
|---|---|---|---|
| (A1, A2) | 1 (same) | 0.0707 | 1×0.0707²=0.0050 |
| (A1, B1) | 0 (diff) | 0.9899 | max(1−0.9899,0)²=0.0101²=0.0001 |
| (A2, B2) | 0 (diff) | D=‖[0.75,0.25]−[0.15,0.85]‖=√(0.6²+0.6²)=√0.72≈0.8485 | max(1−0.8485,0)²=0.1515²=0.0230 |
| (B1, B2) | 1 (same) | D=‖[0.1,0.9]−[0.15,0.85]‖=√(0.05²+0.05²)=√0.005≈0.0707 | 1×0.0707²=0.0050 |
Total loss = 0.0050 + 0.0001 + 0.0230 + 0.0050 = 0.0331
The A2-B2 pair contributes most — they are 0.85 apart, within the margin of 1.0, so there's still 0.0230 loss pushing them further apart.
Margin Sensitivity
- m=0.5: Loss is zero for D≥0.5. Negative pairs beyond 0.5 receive no gradient — may be too loose.
- m=1.0: Loss is zero for D≥1.0. Provides a moderate separation boundary.
- m=2.0: Loss continues until D=2.0 — forces all negatives to be at least 2.0 apart. Can waste capacity pushing already-well-separated pairs further.
Code
import numpy as np
def euclidean(e1, e2): return np.sqrt(np.sum((e1 - e2)**2))
def contrastive_loss(e1, e2, y, margin=1.0):
D = euclidean(e1, e2)
pos_loss = y * D**2
neg_loss = (1 - y) * max(margin - D, 0)**2
return pos_loss + neg_loss, D
A1 = np.array([0.8, 0.2])
A2 = np.array([0.75, 0.25])
B1 = np.array([0.1, 0.9])
B2 = np.array([0.15, 0.85])
pairs = [("A1,A2", A1, A2, 1), ("A1,B1", A1, B1, 0),
("A2,B2", A2, B2, 0), ("B1,B2", B1, B2, 1)]
print(f"{'Pair':>6} | {'y':>2} | {'D':>8} | {'loss':>8}")
total = 0
for name, e1, e2, y in pairs:
loss, D = contrastive_loss(e1, e2, y)
print(f"{name:>6} | {y:>2} | {D:>8.4f} | {loss:>8.4f}")
total += loss
print(f"\nTotal contrastive loss: {total:.4f}")
# Margin sensitivity
print("\nMargin sensitivity (A1-B1 negative pair, D≈0.99):")
for m in [0.5, 1.0, 1.5, 2.0]:
loss, D = contrastive_loss(A1, B1, y=0, margin=m)
print(f" m={m}: loss = {loss:.4f}")Pair | y | D | loss
A1,A2 | 1 | 0.0707 | 0.0050
A1,B1 | 0 | 0.9899 | 0.0001
A2,B2 | 0 | 0.8485 | 0.0230
B1,B2 | 1 | 0.0707 | 0.0050
Total contrastive loss: 0.0331
Margin sensitivity (A1-B1 negative pair, D≈0.99):
m=0.5: loss = 0.0000
m=1.0: loss = 0.0001
m=1.5: loss = 0.2601
m=2.0: loss = 1.0203At m=0.5: the A1-B1 pair (D≈0.99) is already beyond the margin — zero loss. At m=2.0: the pair must be pushed all the way to D=2.0, contributing loss 1.02.
Related Concepts
Contrastive loss is the foundation for metric learning. Triplet loss (12-triplet-loss.md) extends this by using three samples simultaneously — anchor, positive, and negative — providing richer gradient signal. In modern contrastive learning (SimCLR, CLIP), large batches of pairs are used where every sample in the batch serves as a negative for every other sample. This avoids the need for manual pair construction and scales naturally. The Euclidean distance used here is equivalent to binary cross-entropy (03-classification-losses.md) on similarity scores when the embedding norms are fixed.
Honest Limitations
Pair sampling is expensive. With n training samples there are O(n²) possible pairs. In practice, not all pairs provide useful training signal — easy negatives (pairs that are already far apart) contribute zero gradient. Hard negative mining — finding negatives that are within the margin and thus useful for training — requires computing distances between all pairs, which is O(n²) at each epoch. For large datasets this becomes a computational bottleneck.
The margin m is a hyperparameter with no principled default. It depends on the scale of the embedding space. If embeddings are L2-normalized to the unit sphere, typical margins are 0.5 to 1.0. Without normalization, the right margin is completely task-dependent. Wrong m means either all negatives are beyond the margin (no gradient) or the margin is too tight (negatives are never pushed far enough).
Contrastive loss uses only pairwise information. If A1 and C1 are somewhat similar (different people but similar-looking), contrastive loss doesn't capture the three-way relationship between A1, A2 (positive), and C1 (hard negative). Triplet loss handles this directly by making the model learn that D(A,P) < D(A,N) — a relative constraint that contrastive loss's pairwise formulation cannot express.
Test Your Understanding
-
Compute the Euclidean distance between A1=[0.8, 0.2] and B2=[0.15, 0.85]. Is this pair positive or negative? Compute the contrastive loss at m=1.0.
-
A negative pair has D=1.5 and m=1.0. What is the loss and gradient for this pair? Now increase m to 2.0. What changes?
-
You train with m=0.3 on a face recognition task. After training, you find that many negative pairs (different people) are clustered at D≈0.35, which is just barely above the margin. What went wrong and how would you fix it?
-
With n=10,000 training images, how many possible pairs (positive + negative) are there? If 0.5% of pairs are positive (same person), how many positive and negative pairs are there? Why is this imbalance a problem?
-
In SimCLR, the contrastive loss is applied over a batch of n images with 2n augmented views. Every pair (i,j) that's not an augmented view of the same image is a negative. For n=256, how many negative pairs does each sample have? Why is this beneficial compared to mining negatives from the full dataset?