~/blog

Triplet Loss

Jul 3, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Contrastive loss uses pairs and absolute thresholds: push similar pairs below a distance, push dissimilar pairs above a margin. But "how close is close enough?" depends on the task. In face recognition, two Alice photos might be 0.2 apart, but you need Alice to be further from Bob than from herself. The absolute distance matters less than the relative ordering.

Triplet loss enforces exactly this relative constraint. Instead of pairs, it uses triplets: an Anchor (A) sample, a Positive (P) sample of the same class as A, and a Negative (N) sample of a different class. The loss is zero as long as the anchor is closer to the positive than to the negative by at least a margin α. It doesn't care how far apart the clusters are in absolute terms — only that the ordering holds.

Anchor: face embeddings in 2D space. Margin α = 0.5.

python
A1 = [0.8, 0.2]   # Anchor (Alice photo 1)
A2 = [0.75, 0.25] # Positive (Alice photo 2)
B1 = [0.1, 0.9]   # Negative (Bob photo 1)

The Triplet Constraint

Triplet Constraint: d(A,P) + α < d(A,N) A (anchor) P (positive) N (negative) d(A,P) d(A,N) d(A,P) + α d(A,N) must be here Goal: d(A,P) + α < d(A,N) → positives closer than negatives by margin α

The constraint: d(A,P) + α < d(A,N)

This says: the distance from the anchor to its positive must be less than the distance from the anchor to any negative, by at least margin α. If this holds, the triplet contributes zero loss. If it's violated, loss is proportional to how much it's violated.


The Formula

L = max(d(A,P)² − d(A,N)² + α, 0)

  • If d(A,P)² − d(A,N)² + α < 0: d(A,N) > d(A,P) + α, constraint satisfied → loss = 0
  • If d(A,P)² − d(A,N)² + α > 0: constraint violated → loss > 0

Note: squared distances are used for gradient smoothness (no sqrt in the loss).


Case 1: Well-Separated (Loss = 0)

Anchor A1=[0.8,0.2], Positive A2=[0.75,0.25], Negative B1=[0.1,0.9], α=0.5

d(A,P): ‖[0.8,0.2] − [0.75,0.25]‖ = √((0.05)² + (0.05)²) = √0.005 ≈ 0.0707

d(A,N): ‖[0.8,0.2] − [0.1,0.9]‖ = √((0.7)² + (0.7)²) = √0.98 ≈ 0.9899

L = max(0.0707² − 0.9899² + 0.5, 0) = max(0.005 − 0.980 + 0.5, 0) = max(−0.475, 0) = 0

The constraint is satisfied with plenty of margin (N is far from A, P is close to A).


Case 2: Constraint Violated (Loss > 0)

New anchor A_c2=[0.5,0.5], Positive P_c2=[0.48,0.52], Negative N_c2=[0.55,0.55], α=0.5

N is very close to A — closer than P + margin.

d(A,P): ‖[0.5,0.5] − [0.48,0.52]‖ = √((0.02)² + (0.02)²) = √0.0008 ≈ 0.0283

d(A,N): ‖[0.5,0.5] − [0.55,0.55]‖ = √((0.05)² + (0.05)²) = √0.005 ≈ 0.0707

L = max(0.0283² − 0.0707² + 0.5, 0) = max(0.0008 − 0.005 + 0.5, 0) = max(0.4958, 0) = 0.4958

The negative is only 0.0707 from the anchor — much less than d(A,P) + α = 0.0283 + 0.5 = 0.5283 needed. Large loss pushes N away from A and P toward A.


Trace Table

Cased(A,P)d(A,N)d(A,P)²−d(A,N)²max(·,0)Loss
Well-separated (A1,A2,B1)0.07070.98990.005−0.980=−0.975+0.5max(−0.475,0)0
Violated (A_c2,P_c2,N_c2)0.02830.07070.0008−0.005=−0.0042+0.5max(0.496,0)0.496

Triplet vs Contrastive

Contrastive Loss Triplet Loss D=0.3 D=0.8 < margin neg pair loss = (m−0.8)² pos pair loss = 0.3² = 0.09 d(A,P)=0.3 d(A,N)=0.8 triplet loss = 0 (relative order satisfied)

Contrastive loss penalizes the positive pair (D=0.3 contributes 0.3²=0.09) and the negative pair (D=0.8 within the margin). Triplet loss looks only at the relative ordering: N is further than P + α, so loss = 0. The embeddings can slide around in space as long as the ordering is maintained.


Hard Negative Mining

Negative Mining Zones (around Anchor A) A P hard negative d(A,N) < d(A,P) semi-hard d(A,P) < d(A,N) < d(A,P)+α easy negative d(A,N) > d(A,P)+α → loss=0
  • Easy negative: d(A,N) > d(A,P) + α — constraint satisfied, loss = 0, no learning.
  • Semi-hard negative: d(A,P) < d(A,N) < d(A,P) + α — constraint violated, moderate loss, best for learning.
  • Hard negative: d(A,N) < d(A,P) — N is actually closer to A than P is. Very large loss, but gradients can be unstable.

Online triplet mining pseudocode:

text
for each mini-batch B:
    compute pairwise distances for all samples in B
    for each anchor A:
        find all valid positives P (same class, different sample)
        find all semi-hard negatives N:
            d(A,P) < d(A,N) < d(A,P) + alpha
        form triplets (A, P, N) and compute loss
    backprop on mean triplet loss

Code

python
import numpy as np

def euclidean(e1, e2): return np.sqrt(np.sum((e1 - e2)**2))

def triplet_loss(anchor, positive, negative, alpha=0.5):
    d_ap = euclidean(anchor, positive)
    d_an = euclidean(anchor, negative)
    loss = max(d_ap**2 - d_an**2 + alpha, 0)
    return loss, d_ap, d_an

A1 = np.array([0.8, 0.2])
A2 = np.array([0.75, 0.25])
B1 = np.array([0.1, 0.9])

# Case 1: well-separated (loss=0)
loss1, dap1, dan1 = triplet_loss(A1, A2, B1)
print(f"Case 1 (well-separated):")
print(f"  d(A,P) = {dap1:.4f}, d(A,N) = {dan1:.4f}")
print(f"  d(A,P)² - d(A,N)² + α = {dap1**2:.4f} - {dan1**2:.4f} + 0.5 = {dap1**2 - dan1**2 + 0.5:.4f}")
print(f"  Loss = {loss1:.4f}")

# Case 2: violated (loss>0)
A_c2 = np.array([0.5, 0.5])
P_c2 = np.array([0.48, 0.52])
N_c2 = np.array([0.55, 0.55])
loss2, dap2, dan2 = triplet_loss(A_c2, P_c2, N_c2)
print(f"\nCase 2 (constraint violated):")
print(f"  d(A,P) = {dap2:.4f}, d(A,N) = {dan2:.4f}")
print(f"  d(A,P)² - d(A,N)² + α = {dap2**2:.4f} - {dan2**2:.4f} + 0.5 = {dap2**2 - dan2**2 + 0.5:.4f}")
print(f"  Loss = {loss2:.4f}")
text
Case 1 (well-separated):
  d(A,P) = 0.0707, d(A,N) = 0.9899
  d(A,P)² - d(A,N)² + α = 0.0050 - 0.9801 + 0.5 = -0.4751
  Loss = 0.0000

Case 2 (constraint violated):
  d(A,P) = 0.0283, d(A,N) = 0.0707
  d(A,P)² - d(A,N)² + α = 0.0008 - 0.0050 + 0.5 = 0.4958
  Loss = 0.4958

Triplet loss builds on contrastive loss (11-contrastive-loss.md) — both learn embeddings via distance-based objectives, but triplets add a relative constraint that makes the objective more data-efficient. Modern contrastive learning (SimCLR's NT-Xent loss, CLIP's InfoNCE loss) extends triplets to n-way comparisons within a batch: every sample serves as both an anchor and a negative for all other samples simultaneously. InfoNCE (used in CLIP) can be seen as softmax cross-entropy over similarities — the "correct" positive gets the highest similarity, and the loss penalizes any negative that has higher similarity than the positive.

Honest Limitations

The number of possible triplets is O(n³) — for 10,000 training samples, there are roughly 10^12 possible triplets. Even selecting a random subset is expensive. Online mining within mini-batches (computing O(B²) distances per batch of size B) is standard, but it means each training step only sees a tiny fraction of possible triplets.

Margin α has no principled default and interacts with the scale of the embedding space. If embeddings are L2-normalized to the unit sphere, α ∈ [0.2, 0.5] is common. Without normalization, the right α depends on the variance of the embedding distribution, which changes during training. Too-small α means most triplets are "easy" (loss=0) after a few epochs. Too-large α forces hard negatives into the loss that create unstable gradients.

Triplet loss assumes that the Euclidean distance in embedding space reflects semantic similarity. If the network is poorly initialized, early embeddings are random — hard negative mining then finds random samples as negatives, and the loss signal is noisy. Curriculum strategies (starting with semi-hard mining and introducing hard mining gradually) mitigate this but add training complexity.


Test Your Understanding

  1. Given A=[0.5,0.5], P=[0.55,0.45], N=[0.6,0.4], α=0.3. Compute d(A,P) and d(A,N). Is this an easy, semi-hard, or hard negative? Compute the triplet loss.

  2. In Case 1 (A1,A2,B1), the loss is 0. Now shift B1 to [0.5,0.5] (much closer to A1). Compute the new d(A,N) and loss. What does this tell you about the role of negative sample positioning?

  3. With n=1000 training samples and 10 classes (100 samples per class), how many valid triplets exist? How many are easy (loss=0) early in training when embeddings are random? Explain why most early triplets are easy.

  4. NT-Xent (SimCLR loss) can be seen as triplet loss with all other samples in the batch as negatives simultaneously. For a batch of 256 images, how many negatives does each anchor have in NT-Xent vs standard triplet mining? Why is this beneficial?

  5. A face recognition model trained with triplet loss achieves 99.5% verification accuracy. The embedding space has 128 dimensions. An engineer argues they should reduce to 32 dimensions for deployment efficiency. What effect would this dimensionality reduction likely have on the triplet loss landscape and on verification accuracy?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment