~/blog

Loss Function vs Cost Function

Jul 1, 20266 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The terms "loss function" and "cost function" are used interchangeably in most textbooks, tutorials, and framework documentation. That ambiguity hides a distinction that matters when you are reasoning about batch size, mini-batch SGD, and what exactly the optimizer is minimizing at each step.

This post establishes precise definitions and makes the distinction concrete with numbers.


Precise Definitions

Loss function L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾): the error for a single training sample i. It measures how wrong the prediction ŷ⁽ⁱ⁾ is compared to the true label y⁽ⁱ⁾.

Cost function J(W, b): the aggregate error across the entire training set:

J = (1/m) Σᵢ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾)

The cost function is a function of the model parameters (W, b) because the predictions ŷ depend on those parameters. The loss function is a function of a single prediction-label pair.

Objective function: what you actually minimize during training — can be the cost function plus regularization terms (L1, L2, dropout). Regularization is added to the cost, not the individual loss.

The distinction the field ignores: most papers and documentation write "loss function" when they mean cost function. The precise names are most useful when reasoning about the difference between full-batch gradient descent (minimizes J over all m samples), mini-batch SGD (approximates J using k samples), and online learning (updates using one sample at a time — minimizes L per step).


Anchor Computation

The 5-sample churn dataset from the ANN post. After one forward pass through the network, the predictions are:

text
y     = [0,   1,   0,   1,   0  ]   (true labels)
ŷ     = [0.541, 0.823, 0.312, 0.791, 0.458]   (predictions)

Binary cross-entropy loss per sample:

Sample 1 (y=0): L₁ = −log(1 − 0.541) = −log(0.459) = 0.779

Sample 2 (y=1): L₂ = −log(0.823) = 0.195

Sample 3 (y=0): L₃ = −log(1 − 0.312) = −log(0.688) = 0.375

Sample 4 (y=1): L₄ = −log(0.791) = 0.234

Sample 5 (y=0): L₅ = −log(1 − 0.458) = −log(0.542) = 0.612

Cost J = (0.779 + 0.195 + 0.375 + 0.234 + 0.612) / 5 = 2.195 / 5 = 0.439

Sample 1 has the highest individual loss (0.779) — the network predicts 54.1% churn but the true label is 0 (no churn). It is the sample most in need of correction.


Why the Distinction Matters for Optimization

Full-batch gradient descent computes the gradient of J over all 5 samples, then takes one weight update:

ΔW = −η × (1/5) Σᵢ ∂L⁽ⁱ⁾/∂W

Online learning (batch size = 1) updates weights after every single sample:

ΔW = −η × ∂L⁽¹⁾/∂W, then ΔW = −η × ∂L⁽²⁾/∂W, etc.

Mini-batch SGD uses k samples (typically 32, 64, or 128) at a time. The batch gradient is an approximation of the full gradient — closer to the true gradient as k grows, but noisier and faster when k is small.

The noise from small batches is not purely bad. It acts as regularization — slightly different gradients each step prevent the optimizer from falling into sharp, narrow minima that generalize poorly. The flat minima that mini-batch SGD tends to find generalize better to test data.

Cost Landscape: Full-Batch GD vs Mini-Batch SGD min J start Full-batch GD (smooth path) Mini-batch SGD (noisy but fast) Each mini-batch step approximates the gradient using k samples, not all m

Code

python
import numpy as np

y     = np.array([0, 1, 0, 1, 0])
y_hat = np.array([0.541, 0.823, 0.312, 0.791, 0.458])

# Per-sample BCE losses
losses = -(y * np.log(y_hat + 1e-8) + (1 - y) * np.log(1 - y_hat + 1e-8))
cost = losses.mean()

print("Per-sample losses (L per sample):")
for i, (yi, yhi, li) in enumerate(zip(y, y_hat, losses), 1):
    print(f"  Sample {i}: y={yi}, ŷ={yhi:.3f}, L={li:.4f}")
print(f"\nCost J (mean of all losses): {cost:.4f}")
text
Per-sample losses (L per sample):
  Sample 1: y=0, ŷ=0.541, L=0.7793
  Sample 2: y=1, ŷ=0.823, L=0.1948
  Sample 3: y=0, ŷ=0.312, L=0.3747
  Sample 4: y=1, ŷ=0.791, L=0.2341
  Sample 5: y=0, ŷ=0.458, L=0.6123

Cost J (mean of all losses): 0.4390

Where this builds from: Backpropagation computes the gradient of the loss with respect to each weight. That gradient is then summed (or averaged) over a batch to form the gradient of the cost, which drives the weight update. The chain rule and the backpropagation post established how single-sample gradients are computed.

Where this leads: Regression losses and classification losses (next posts) are specific formulas for L. Optimizers (section 5) differ in how they use the cost gradient — vanilla gradient descent, SGD, momentum, Adam. Each represents a different strategy for navigating the cost landscape shown above.


Honest Limitations

The cost function (mean of losses) treats all training samples equally. In class-imbalanced datasets — 95% class 0, 5% class 1 — the cost is dominated by the majority class. The minority class' losses are averaged in but carry little total weight. Weighted loss functions or focal loss address this by amplifying the contribution of minority-class samples to the cost.

Minimizing the cost on the training set is not the same as minimizing error on the test set. The cost can reach zero (perfect training fit) while test error is high — overfitting. The relationship between the cost landscape and generalization is what regularization is designed to address.


Test Your Understanding

  1. A dataset has 1,000 samples. You run full-batch gradient descent (batch size = 1,000) and also mini-batch SGD with batch size = 32. Full-batch takes 5 seconds per step; mini-batch takes 0.02 seconds per step. How many weight updates per second does each approach give? Which is faster for reducing the cost by a fixed amount, assuming the noisy mini-batch gradient points in approximately the right direction?

  2. In the anchor example, Sample 1 has loss L₁ = 0.779 and Sample 2 has loss L₂ = 0.195. During full-batch gradient descent, both contribute equally to the weight update. During online learning (batch size = 1), if Sample 1 arrives first, what happens to the weights relative to processing Sample 2 first?

  3. The cost function J = (1/m) Σ L is differentiable with respect to the weights W. The individual loss L⁽ⁱ⁾ is also differentiable. Show that ∂J/∂W = (1/m) Σᵢ ∂L⁽ⁱ⁾/∂W — that is, the gradient of the cost is the average of the gradients of the individual losses.

  4. A practitioner uses batch size = 1 and observes that the training loss oscillates wildly rather than decreasing monotonically. They switch to batch size = 32 and observe smoother but still noisy convergence. At batch size = 1000 (full batch), the loss decreases smoothly. Why does larger batch size produce smoother loss trajectories, and what is the theoretical trade-off?

  5. The objective function in L2 regularization is J_reg = J + λ/2 × Σ w². This is the cost function plus a penalty term. When computing the gradient ∂J_reg/∂wⱼ, what does the regularization term contribute? How does this differ from computing ∂L⁽ⁱ⁾/∂wⱼ for a single sample?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment