~/blog

Contrastive Loss

Jul 3, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Standard image classification trains a model to output a fixed set of class labels. But many problems don't have a fixed class list. Face verification — "is this the same person as that photo?" — can involve billions of distinct identities, and new identities are added constantly. No classifier can enumerate them all.

The solution is to train a model to produce embeddings: vectors in a space where semantic similarity corresponds to geometric proximity. Same-person face photos should have nearby embeddings. Different-person photos should have distant embeddings. Given two new photos, you compare their embeddings rather than predicting class labels.

Contrastive loss (Hadsell et al., 2006) is the foundational loss for this setup. It takes pairs of samples and a binary label: 1 = same class (positive pair), 0 = different class (negative pair). Similar pairs are pulled together. Dissimilar pairs are pushed apart — but only if they're closer than a margin m. If they're already far apart, no gradient flows.

Anchor: 4 face images in 2D embedding space.

python
A1 = [0.8, 0.2]   # Alice, photo 1
A2 = [0.75, 0.25] # Alice, photo 2
B1 = [0.1, 0.9]   # Bob, photo 1
B2 = [0.15, 0.85] # Bob, photo 2

Embedding Space

Learned Embedding Space embedding dim 1 embedding dim 2 A1 A2 Alice cluster B1 B2 Bob cluster D≈0.99 (inter-cluster)

The Formula

For a pair (x₁, x₂) with label y (1=similar, 0=dissimilar) and Euclidean distance D = ‖f(x₁) − f(x₂)‖₂:

L = y · D² + (1 − y) · max(m − D, 0)²

Positive pair (y=1): L = D² — pull the pair together (loss grows with distance)

Negative pair (y=0): L = max(m − D, 0)² — push apart only if D < m; if already far enough, loss = 0

The margin m defines "far enough." Negative pairs further than m contribute zero gradient — the model only spends capacity separating pairs that are actually ambiguous.


Computation on Anchor

Positive pair (A1, A2) — same person:

D = ‖[0.8,0.2] − [0.75,0.25]‖ = √((0.8−0.75)² + (0.2−0.25)²) = √(0.05² + 0.05²) = √(0.0025 + 0.0025) = √0.005 ≈ 0.0707

Loss = 1 × D² = 1 × 0.005 = 0.005 — very small because these photos are already very close.

Negative pair (A1, B1) — different people:

D = ‖[0.8,0.2] − [0.1,0.9]‖ = √((0.7)² + (0.7)²) = √(0.49 + 0.49) = √0.98 ≈ 0.9899

m − D = 1.0 − 0.9899 = 0.0101 — just barely inside the margin.

Loss = (1−0) × max(0.0101, 0)² = 0.0101² ≈ 0.0001 — tiny loss because the pair is almost at the margin.

What if D > m? If D = 1.2 and m = 1.0: max(1.0 − 1.2, 0) = max(−0.2, 0) = 0 → loss = 0. Already far enough, no gradient.


Trace Table

Pairlabel yDloss
(A1, A2)1 (same)0.07071×0.0707²=0.0050
(A1, B1)0 (diff)0.9899max(1−0.9899,0)²=0.0101²=0.0001
(A2, B2)0 (diff)D=‖[0.75,0.25]−[0.15,0.85]‖=√(0.6²+0.6²)=√0.72≈0.8485max(1−0.8485,0)²=0.1515²=0.0230
(B1, B2)1 (same)D=‖[0.1,0.9]−[0.15,0.85]‖=√(0.05²+0.05²)=√0.005≈0.07071×0.0707²=0.0050

Total loss = 0.0050 + 0.0001 + 0.0230 + 0.0050 = 0.0331

The A2-B2 pair contributes most — they are 0.85 apart, within the margin of 1.0, so there's still 0.0230 loss pushing them further apart.


Margin Sensitivity

Negative Pair Loss vs Distance (max(m−D,0)²) 0 0 D=1.0 D=2.0 m=0.5 m=1.0 m=2.0 loss=0 when D≥m
  • m=0.5: Loss is zero for D≥0.5. Negative pairs beyond 0.5 receive no gradient — may be too loose.
  • m=1.0: Loss is zero for D≥1.0. Provides a moderate separation boundary.
  • m=2.0: Loss continues until D=2.0 — forces all negatives to be at least 2.0 apart. Can waste capacity pushing already-well-separated pairs further.

Code

python
import numpy as np

def euclidean(e1, e2): return np.sqrt(np.sum((e1 - e2)**2))

def contrastive_loss(e1, e2, y, margin=1.0):
    D = euclidean(e1, e2)
    pos_loss = y * D**2
    neg_loss = (1 - y) * max(margin - D, 0)**2
    return pos_loss + neg_loss, D

A1 = np.array([0.8, 0.2])
A2 = np.array([0.75, 0.25])
B1 = np.array([0.1, 0.9])
B2 = np.array([0.15, 0.85])

pairs = [("A1,A2", A1, A2, 1), ("A1,B1", A1, B1, 0),
         ("A2,B2", A2, B2, 0), ("B1,B2", B1, B2, 1)]

print(f"{'Pair':>6} | {'y':>2} | {'D':>8} | {'loss':>8}")
total = 0
for name, e1, e2, y in pairs:
    loss, D = contrastive_loss(e1, e2, y)
    print(f"{name:>6} | {y:>2} | {D:>8.4f} | {loss:>8.4f}")
    total += loss
print(f"\nTotal contrastive loss: {total:.4f}")

# Margin sensitivity
print("\nMargin sensitivity (A1-B1 negative pair, D≈0.99):")
for m in [0.5, 1.0, 1.5, 2.0]:
    loss, D = contrastive_loss(A1, B1, y=0, margin=m)
    print(f"  m={m}: loss = {loss:.4f}")
text
Pair |  y |        D |     loss
A1,A2 |  1 |   0.0707 |   0.0050
A1,B1 |  0 |   0.9899 |   0.0001
A2,B2 |  0 |   0.8485 |   0.0230
B1,B2 |  1 |   0.0707 |   0.0050

Total contrastive loss: 0.0331

Margin sensitivity (A1-B1 negative pair, D≈0.99):
  m=0.5: loss = 0.0000
  m=1.0: loss = 0.0001
  m=1.5: loss = 0.2601
  m=2.0: loss = 1.0203

At m=0.5: the A1-B1 pair (D≈0.99) is already beyond the margin — zero loss. At m=2.0: the pair must be pushed all the way to D=2.0, contributing loss 1.02.


Contrastive loss is the foundation for metric learning. Triplet loss (12-triplet-loss.md) extends this by using three samples simultaneously — anchor, positive, and negative — providing richer gradient signal. In modern contrastive learning (SimCLR, CLIP), large batches of pairs are used where every sample in the batch serves as a negative for every other sample. This avoids the need for manual pair construction and scales naturally. The Euclidean distance used here is equivalent to binary cross-entropy (03-classification-losses.md) on similarity scores when the embedding norms are fixed.

Honest Limitations

Pair sampling is expensive. With n training samples there are O(n²) possible pairs. In practice, not all pairs provide useful training signal — easy negatives (pairs that are already far apart) contribute zero gradient. Hard negative mining — finding negatives that are within the margin and thus useful for training — requires computing distances between all pairs, which is O(n²) at each epoch. For large datasets this becomes a computational bottleneck.

The margin m is a hyperparameter with no principled default. It depends on the scale of the embedding space. If embeddings are L2-normalized to the unit sphere, typical margins are 0.5 to 1.0. Without normalization, the right margin is completely task-dependent. Wrong m means either all negatives are beyond the margin (no gradient) or the margin is too tight (negatives are never pushed far enough).

Contrastive loss uses only pairwise information. If A1 and C1 are somewhat similar (different people but similar-looking), contrastive loss doesn't capture the three-way relationship between A1, A2 (positive), and C1 (hard negative). Triplet loss handles this directly by making the model learn that D(A,P) < D(A,N) — a relative constraint that contrastive loss's pairwise formulation cannot express.


Test Your Understanding

  1. Compute the Euclidean distance between A1=[0.8, 0.2] and B2=[0.15, 0.85]. Is this pair positive or negative? Compute the contrastive loss at m=1.0.

  2. A negative pair has D=1.5 and m=1.0. What is the loss and gradient for this pair? Now increase m to 2.0. What changes?

  3. You train with m=0.3 on a face recognition task. After training, you find that many negative pairs (different people) are clustered at D≈0.35, which is just barely above the margin. What went wrong and how would you fix it?

  4. With n=10,000 training images, how many possible pairs (positive + negative) are there? If 0.5% of pairs are positive (same person), how many positive and negative pairs are there? Why is this imbalance a problem?

  5. In SimCLR, the contrastive loss is applied over a batch of n images with 2n augmented views. Every pair (i,j) that's not an augmented view of the same image is a negative. For n=256, how many negative pairs does each sample have? Why is this beneficial compared to mining negatives from the full dataset?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment