~/blog

Triplet Loss

Jul 3, 2026•8 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

Contrastive loss uses pairs and absolute thresholds: push similar pairs below a distance, push dissimilar pairs above a margin. But "how close is close enough?" depends on the task. In face recognition, two Alice photos might be 0.2 apart, but you need Alice to be further from Bob than from herself. The absolute distance matters less than the relative ordering.

Triplet loss enforces exactly this relative constraint. Instead of pairs, it uses triplets: an Anchor (A) sample, a Positive (P) sample of the same class as A, and a Negative (N) sample of a different class. The loss is zero as long as the anchor is closer to the positive than to the negative by at least a margin α. It doesn't care how far apart the clusters are in absolute terms — only that the ordering holds.

Anchor: face embeddings in 2D space. Margin α = 0.5.

python

A1 = [0.8, 0.2]   # Anchor (Alice photo 1)
A2 = [0.75, 0.25] # Positive (Alice photo 2)
B1 = [0.1, 0.9]   # Negative (Bob photo 1)

The Triplet Constraint

The constraint: d(A,P) + α < d(A,N)

This says: the distance from the anchor to its positive must be less than the distance from the anchor to any negative, by at least margin α. If this holds, the triplet contributes zero loss. If it's violated, loss is proportional to how much it's violated.

The Formula

L = max(d(A,P)² − d(A,N)² + α, 0)

If d(A,P)² − d(A,N)² + α < 0: d(A,N) > d(A,P) + α, constraint satisfied → loss = 0
If d(A,P)² − d(A,N)² + α > 0: constraint violated → loss > 0

Note: squared distances are used for gradient smoothness (no sqrt in the loss).

Case 1: Well-Separated (Loss = 0)

Anchor A1=[0.8,0.2], Positive A2=[0.75,0.25], Negative B1=[0.1,0.9], α=0.5

d(A,P): ‖[0.8,0.2] − [0.75,0.25]‖ = √((0.05)² + (0.05)²) = √0.005 ≈ 0.0707

d(A,N): ‖[0.8,0.2] − [0.1,0.9]‖ = √((0.7)² + (0.7)²) = √0.98 ≈ 0.9899

L = max(0.0707² − 0.9899² + 0.5, 0) = max(0.005 − 0.980 + 0.5, 0) = max(−0.475, 0) = 0

The constraint is satisfied with plenty of margin (N is far from A, P is close to A).

Case 2: Constraint Violated (Loss > 0)

New anchor A_c2=[0.5,0.5], Positive P_c2=[0.48,0.52], Negative N_c2=[0.55,0.55], α=0.5

N is very close to A — closer than P + margin.

d(A,P): ‖[0.5,0.5] − [0.48,0.52]‖ = √((0.02)² + (0.02)²) = √0.0008 ≈ 0.0283

d(A,N): ‖[0.5,0.5] − [0.55,0.55]‖ = √((0.05)² + (0.05)²) = √0.005 ≈ 0.0707

L = max(0.0283² − 0.0707² + 0.5, 0) = max(0.0008 − 0.005 + 0.5, 0) = max(0.4958, 0) = 0.4958

The negative is only 0.0707 from the anchor — much less than d(A,P) + α = 0.0283 + 0.5 = 0.5283 needed. Large loss pushes N away from A and P toward A.

Trace Table

Case	d(A,P)	d(A,N)	d(A,P)²−d(A,N)²	+α	max(·,0)	Loss
Well-separated (A1,A2,B1)	0.0707	0.9899	0.005−0.980=−0.975	+0.5	max(−0.475,0)	0
Violated (A_c2,P_c2,N_c2)	0.0283	0.0707	0.0008−0.005=−0.0042	+0.5	max(0.496,0)	0.496

Triplet vs Contrastive

Contrastive loss penalizes the positive pair (D=0.3 contributes 0.3²=0.09) and the negative pair (D=0.8 within the margin). Triplet loss looks only at the relative ordering: N is further than P + α, so loss = 0. The embeddings can slide around in space as long as the ordering is maintained.

Hard Negative Mining

Easy negative: d(A,N) > d(A,P) + α — constraint satisfied, loss = 0, no learning.
Semi-hard negative: d(A,P) < d(A,N) < d(A,P) + α — constraint violated, moderate loss, best for learning.
Hard negative: d(A,N) < d(A,P) — N is actually closer to A than P is. Very large loss, but gradients can be unstable.

Online triplet mining pseudocode:

text

for each mini-batch B:
    compute pairwise distances for all samples in B
    for each anchor A:
        find all valid positives P (same class, different sample)
        find all semi-hard negatives N:
            d(A,P) < d(A,N) < d(A,P) + alpha
        form triplets (A, P, N) and compute loss
    backprop on mean triplet loss

Code

python

import numpy as np

def euclidean(e1, e2): return np.sqrt(np.sum((e1 - e2)**2))

def triplet_loss(anchor, positive, negative, alpha=0.5):
    d_ap = euclidean(anchor, positive)
    d_an = euclidean(anchor, negative)
    loss = max(d_ap**2 - d_an**2 + alpha, 0)
    return loss, d_ap, d_an

A1 = np.array([0.8, 0.2])
A2 = np.array([0.75, 0.25])
B1 = np.array([0.1, 0.9])

# Case 1: well-separated (loss=0)
loss1, dap1, dan1 = triplet_loss(A1, A2, B1)
print(f"Case 1 (well-separated):")
print(f"  d(A,P) = {dap1:.4f}, d(A,N) = {dan1:.4f}")
print(f"  d(A,P)² - d(A,N)² + α = {dap1**2:.4f} - {dan1**2:.4f} + 0.5 = {dap1**2 - dan1**2 + 0.5:.4f}")
print(f"  Loss = {loss1:.4f}")

# Case 2: violated (loss>0)
A_c2 = np.array([0.5, 0.5])
P_c2 = np.array([0.48, 0.52])
N_c2 = np.array([0.55, 0.55])
loss2, dap2, dan2 = triplet_loss(A_c2, P_c2, N_c2)
print(f"\nCase 2 (constraint violated):")
print(f"  d(A,P) = {dap2:.4f}, d(A,N) = {dan2:.4f}")
print(f"  d(A,P)² - d(A,N)² + α = {dap2**2:.4f} - {dan2**2:.4f} + 0.5 = {dap2**2 - dan2**2 + 0.5:.4f}")
print(f"  Loss = {loss2:.4f}")

text

Case 1 (well-separated):
  d(A,P) = 0.0707, d(A,N) = 0.9899
  d(A,P)² - d(A,N)² + α = 0.0050 - 0.9801 + 0.5 = -0.4751
  Loss = 0.0000

Case 2 (constraint violated):
  d(A,P) = 0.0283, d(A,N) = 0.0707
  d(A,P)² - d(A,N)² + α = 0.0008 - 0.0050 + 0.5 = 0.4958
  Loss = 0.4958

Triplet loss builds on contrastive loss (11-contrastive-loss.md) — both learn embeddings via distance-based objectives, but triplets add a relative constraint that makes the objective more data-efficient. Modern contrastive learning (SimCLR's NT-Xent loss, CLIP's InfoNCE loss) extends triplets to n-way comparisons within a batch: every sample serves as both an anchor and a negative for all other samples simultaneously. InfoNCE (used in CLIP) can be seen as softmax cross-entropy over similarities — the "correct" positive gets the highest similarity, and the loss penalizes any negative that has higher similarity than the positive.

Honest Limitations

The number of possible triplets is O(n³) — for 10,000 training samples, there are roughly 10^12 possible triplets. Even selecting a random subset is expensive. Online mining within mini-batches (computing O(B²) distances per batch of size B) is standard, but it means each training step only sees a tiny fraction of possible triplets.

Margin α has no principled default and interacts with the scale of the embedding space. If embeddings are L2-normalized to the unit sphere, α ∈ [0.2, 0.5] is common. Without normalization, the right α depends on the variance of the embedding distribution, which changes during training. Too-small α means most triplets are "easy" (loss=0) after a few epochs. Too-large α forces hard negatives into the loss that create unstable gradients.

Triplet loss assumes that the Euclidean distance in embedding space reflects semantic similarity. If the network is poorly initialized, early embeddings are random — hard negative mining then finds random samples as negatives, and the loss signal is noisy. Curriculum strategies (starting with semi-hard mining and introducing hard mining gradually) mitigate this but add training complexity.

Test Your Understanding

Given A=[0.5,0.5], P=[0.55,0.45], N=[0.6,0.4], α=0.3. Compute d(A,P) and d(A,N). Is this an easy, semi-hard, or hard negative? Compute the triplet loss.
In Case 1 (A1,A2,B1), the loss is 0. Now shift B1 to [0.5,0.5] (much closer to A1). Compute the new d(A,N) and loss. What does this tell you about the role of negative sample positioning?
With n=1000 training samples and 10 classes (100 samples per class), how many valid triplets exist? How many are easy (loss=0) early in training when embeddings are random? Explain why most early triplets are easy.
NT-Xent (SimCLR loss) can be seen as triplet loss with all other samples in the batch as negatives simultaneously. For a batch of 256 images, how many negatives does each anchor have in NT-Xent vs standard triplet mining? Why is this beneficial?
A face recognition model trained with triplet loss achieves 99.5% verification accuracy. The embedding space has 128 dimensions. An engineer argues they should reduce to 32 dimensions for deployment efficiency. What effect would this dimensionality reduction likely have on the triplet loss landscape and on verification accuracy?

Triplet Loss

The Triplet Constraint

The Formula

Case 1: Well-Separated (Loss = 0)

Case 2: Constraint Violated (Loss > 0)

Trace Table

Triplet vs Contrastive

Hard Negative Mining

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Triplet Loss

The Triplet Constraint

The Formula

Case 1: Well-Separated (Loss = 0)

Case 2: Constraint Violated (Loss > 0)

Trace Table

Triplet vs Contrastive

Hard Negative Mining

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment