~/blog

Random Variables

Apr 11, 2026•11 min read•By Mohammed Vasim

StatisticsMathData Science

Before you run a batch of API requests, the number of errors you will get is unknown. It could be 0, 1, 2, or more. After you run the batch, you have a specific number. Before that moment, it is a random variable. Every model metric you compute — accuracy, loss, error count, latency — is a random variable before you observe it.

Random variables are the formal tool for reasoning about uncertain quantities. Without them, you cannot write down what a probability distribution means, what an expected value is, or why the Central Limit Theorem applies to your model's batch performance.

The Anchor Datasets

This post requires two anchors: one discrete, one continuous.

Discrete anchor — number of model API errors in a 10-request batch, observed over 20 batches:

python

n_errors = [0, 1, 0, 2, 0, 1, 0, 0, 3, 1, 0, 2, 0, 1, 0, 0, 1, 0, 2, 0]
# P(X=0) ≈ 0.50, P(X=1) ≈ 0.25, P(X=2) ≈ 0.15, P(X=3) ≈ 0.05, P(X≥4) ≈ 0.05

Continuous anchor — model inference latency in ms:

python

# response_time ~ Normal(mu=45, sigma=8)
# Any real value in (0, ∞) is possible

What Is a Random Variable?

A random variable is a function that maps outcomes of a random experiment to numbers:

$X : Ω \to R$

The "random" refers to the process producing the outcome, not the variable itself. X is a precise rule: given the outcome of a batch (which requests arrived, which were edge cases), it tells you the error count.

X vs x: The random variable X is the abstract rule. A lowercase x is one observed value. After batch 1, you observed x₁ = 0. After batch 9, x₉ = 3. The variable X remains the same rule throughout — it is the 20 observations x₁ through x₂₀ that change.

Discrete Random Variables

A discrete random variable takes countable values — typically integers.

The Probability Mass Function (PMF) assigns a probability to each possible value:

$P (X = k) \geq 0 and \sum_{k} P (X = k) = 1$

From the 20 observed batches, the empirical PMF for n_errors:

k	P(X = k)	Meaning
0	0.50	Half of batches have no errors
1	0.25	Quarter of batches have one error
2	0.15	15% have two errors
3	0.05	5% have three errors
≥4	0.05	5% have four or more errors

Verify: 0.50 + 0.25 + 0.15 + 0.05 + 0.05 = 1.00 ✓

P(X ≤ 1) = P(X=0) + P(X=1) = 0.50 + 0.25 = 0.75

Expected Value:

$E [X] = \sum_{k} k \cdot P (X = k)$

$= 0 \times 0.50 + 1 \times 0.25 + 2 \times 0.15 + 3 \times 0.05 = 0 + 0.25 + 0.30 + 0.15 = 0.70$

With ≥4 contributing roughly 0.05 × 4 = 0.20, E[X] ≈ 0.90 errors per batch. The expected value does not have to be an achievable value — you cannot have 0.9 errors in a single batch.

Variance:

k	P(X=k)	k·P(X=k)	k²·P(X=k)
0	0.50	0.00	0.00
1	0.25	0.25	0.25
2	0.15	0.30	0.60
3	0.05	0.15	0.45
Sum	1.00	E[X] = 0.70	E[X²] = 1.30

$Var (X) = E [X^{2}] - (E [X])^{2} = 1.30 - (0.70)^{2} = 1.30 - 0.49 = 0.81$

Continuous Random Variables

A continuous random variable takes any real value in a range. Inference latency can be 44.7ms, 44.73ms, 44.731ms — the precision is limited only by the clock.

The key difference: P(X = x) = 0 for any exact value. Asking "what is the probability that latency is exactly 45.0000ms?" is zero — not because it never happens, but because the real line contains infinitely many values and any specific one has measure zero. You can only ask about intervals.

Probability Density Function (PDF):

$P (a \leq X \leq b) = \int_{a}^{b} f (x) d x$

Three important properties:

f(x) ≥ 0 everywhere
∫ f(x) dx = 1 over the full support
f(x) is not a probability. It is a density — a rate of probability per unit of x. f(x) can be greater than 1.

For response_time ~ Normal(μ=45, σ=8), the PDF peak at x=45 is f(45) = 1/(8√(2π)) ≈ 0.0499. That is a density value, not a probability.

Cumulative Distribution Function (CDF)

The CDF F(x) = P(X ≤ x) works for both discrete and continuous random variables and is the single most useful function for computing probabilities.

Properties of F(x):

F(x) is non-decreasing — probability never goes backward as x increases
As x → −∞, F(x) → 0
As x → +∞, F(x) → 1

Discrete CDF: Step Function

For n_errors, the CDF jumps at each integer value:

F(0) = P(X ≤ 0) = P(X=0) = 0.50
F(1) = P(X ≤ 1) = 0.50 + 0.25 = 0.75
F(2) = P(X ≤ 2) = 0.75 + 0.15 = 0.90
F(3) = P(X ≤ 3) = 0.90 + 0.05 = 0.95

P(X > 1) = 1 - F(1) = 1 - 0.75 = 0.25 — one in four batches has more than one error.

Continuous CDF: S-Curve

For response_time ~ Normal(45, 8), F(x) = P(X ≤ x) is a smooth S-shaped curve:

F(45) = 0.50 (by symmetry of the Normal — half the latencies are below the mean)
F(50) ≈ 0.734
F(40) ≈ 0.266

P(40 ≤ X ≤ 50) = F(50) − F(40) = 0.734 − 0.266 = 0.468

Python Example

python

import numpy as np
from scipy import stats

k_vals = [0, 1, 2, 3]
probs  = [0.50, 0.25, 0.15, 0.05]

E_X  = sum(k * p for k, p in zip(k_vals, probs))
E_X2 = sum(k**2 * p for k, p in zip(k_vals, probs))
Var_X = E_X2 - E_X**2

mu, sigma = 45, 8
dist = stats.norm(mu, sigma)
prob_40_50  = dist.cdf(50) - dist.cdf(40)
prob_above_60 = 1 - dist.cdf(60)

print(f"Discrete: E[X]={E_X:.3f}, Var[X]={Var_X:.4f}")
print(f"Discrete: F(1) = {0.50 + 0.25:.2f}")
print(f"Continuous: P(40 <= X <= 50) = {prob_40_50:.4f}")
print(f"Continuous: P(X > 60) = {prob_above_60:.4f}")

text

Discrete: E[X]=0.700, Var[X]=0.8100
Discrete: F(1) = 0.75
Continuous: P(40 <= X <= 50) = 0.4680
Continuous: P(X > 60) = 0.0304

Calculation Trace

Phase	Formula	Values	Result
E[X] (discrete)	$\sum k \cdot P (X = k)$	0(0.5)+1(0.25)+2(0.15)+3(0.05)	0.70
E[X²] (discrete)	$\sum k^{2} \cdot P (X = k)$	0+0.25+0.60+0.45	1.30
Var(X)	$E [X^{2}] - (E [X])^{2}$	1.30 − 0.49	0.81
F(1) (discrete CDF)	P(X≤1)	0.50+0.25	0.75
P(40≤X≤50)	F(50)−F(40)	0.734−0.266	0.468

PMF vs PDF vs CDF

Property	PMF (Discrete)	PDF (Continuous)	CDF (Both)
Definition	P(X = k)	density f(x)	P(X ≤ x)
Point probability	P(X = k) ≥ 0	P(X = x) = 0	F(x)
Total	∑ P(X=k) = 1	∫ f(x)dx = 1	F(∞) = 1
Shape	Bar chart	Smooth curve	Step / S-curve
Anchor	n_errors	response_time	Both

Three Common Misconceptions

"The PDF value is a probability." Wrong. f(45) ≈ 0.0499 for response_time — that is a density, not a probability. P(X = 45) = 0 exactly. Probabilities come from integrating f(x) over an interval.

"All random variables are normally distributed." Wrong. Error counts follow a discrete distribution (Poisson is common). Click rates follow Bernoulli/Binomial. Latency is often right-skewed or log-normal, not Normal. Assuming Normality without checking is a modeling error.

"Variance and standard deviation are the same thing." Wrong. Var(X) is in squared units (ms² for latency). SD = √Var(X) brings it back to original units (ms). Both carry information: variance is algebraically cleaner for calculations; SD is interpretable alongside the mean.

DS/ML Framing

A model's output score (probability of class 1) is a continuous RV drawn from P(Y|X=x).
A class label is a discrete RV: takes value 0 or 1 in binary classification.
Batch loss is a continuous RV — the Central Limit Theorem explains why batch averages approximate Normal.
If you sample learning rate uniformly from [1e-5, 1e-2] in a hyperparameter search, learning rate is a continuous RV with a Uniform distribution.

The previous posts established variable types (nominal, ordinal, discrete, continuous). Random variables are the probability-theoretic formalization of those types. Once you have a random variable, you have a distribution — and from a distribution you get expected value, variance, and the CDFs used in hypothesis testing and confidence intervals. The next post on histograms shows the empirical counterpart: what the distribution of observed values actually looks like, as an approximation to the theoretical PDF.

When This Framework Breaks Down

E[X] and Var(X) are properties of the theoretical distribution, not guarantees about finite samples. From 20 batches, the empirical PMF is a rough estimate — a different 20 batches would give different probabilities. The distribution is only as stable as the process generating it: if the model degrades over time (model drift), the error-count distribution shifts and the E[X] computed from historical batches becomes invalid.

Test Your Understanding

From n_errors = [0, 1, 0, 2, 0, 1, 0, 0, 3, 1, 0, 2, 0, 1, 0, 0, 1, 0, 2, 0], compute F(2) — the probability that a randomly chosen batch has 2 or fewer errors.
For response_time ~ Normal(45, 8), compute P(X > 53) using the fact that 53 = 45 + 1σ. What property of the Normal distribution tells you this directly?
Why is P(X = 45.0) = 0 for a continuous random variable, even though you have observed latencies close to 45ms? What does this imply for how you should compute the probability of a specific latency window?
A monitoring system triggers an alert if P(n_errors ≥ 3) > 0.10. From the empirical PMF above, does the alert trigger? Show your calculation using the complement of the CDF.
You observe batch error counts over 100 days: most days have 0 or 1 errors, but occasionally 10 or more. Would a Poisson distribution or a Normal distribution better model this random variable? Explain why based on the properties of each distribution.

Random Variables

The Anchor Datasets

What Is a Random Variable?

Discrete Random Variables

Continuous Random Variables

Cumulative Distribution Function (CDF)

Discrete CDF: Step Function

Continuous CDF: S-Curve

Python Example

Calculation Trace

PMF vs PDF vs CDF

Three Common Misconceptions

DS/ML Framing

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment

Random Variables

The Anchor Datasets

What Is a Random Variable?

Discrete Random Variables

Continuous Random Variables

Cumulative Distribution Function (CDF)

Discrete CDF: Step Function

Continuous CDF: S-Curve

Python Example

Calculation Trace

PMF vs PDF vs CDF

Three Common Misconceptions

DS/ML Framing

Related Concepts

When This Framework Breaks Down

Test Your Understanding

Comments (0)

Leave a comment