Back to blog
← View series: statistics
Statistics & Probability

~/blog

Counting Principles

Jun 21, 202612 min readBy Mohammed Vasim
StatisticsMathData Science

For outcomes that are equally likely, probability is a fraction:

P(event) = number of favorable outcomes / total possible outcomes

Both the numerator and denominator require counting. If you cannot count the total outcomes, you cannot compute the probability. Counting principles are the engine that makes discrete probability computable.

Concrete question: If you randomly pick 3 hyperparameter configurations from a grid search of 9 configurations to evaluate first, how many possible sets of 3 could you choose? The answer is 84 — computed by combinations. Without that, P(chosen set includes the optimal config) has no denominator.

Anchor: A model evaluation setup with:

  • Learning rates: {0.001, 0.01, 0.1} — 3 choices
  • Batch sizes: {32, 64, 128} — 3 choices
  • 10 engineers who need to be assigned to review results

The Fundamental Counting Principle

If task A can be done in m ways and task B can be done in n ways independently, then both together can be done in m × n ways.

For the grid search anchor: 3 learning rates × 3 batch sizes = 9 total configurations.

Add a third hyperparameter (dropout: 2 choices): 3 × 3 × 2 = 18 configurations.

General form: |A₁ × A₂ × ... × Aₖ| = |A₁| × |A₂| × ... × |Aₖ|

The tree below shows why: each path from root to leaf is one unique combination of choices.

start

lr=0.001 lr=0.01 lr=0.1

0.001 0.01 0.1

32 64 128 32 64 128 32 64 128

(0.001,32) (0.001,64) (0.001,128) (0.01,32) (0.01,64) (0.01,128) (0.1,32) (0.1,64) (0.1,128)

3 learning rates × 3 batch sizes = 9 configurations Each path from root to leaf is one unique (lr, batch) pair


Factorials

Factorials are the building block for both permutations and combinations.

Definition: n! = n × (n−1) × (n−2) × ... × 2 × 1

Examples:

1! = 1 2! = 2 × 1 = 2 3! = 3 × 2 × 1 = 6 4! = 4 × 3 × 2 × 1 = 24 5! = 5 × 4 × 3 × 2 × 1 = 120 10! = 3,628,800

Why 0! = 1: The empty product convention — a product of zero terms equals the multiplicative identity, 1. More concretely, C(n, 0) = 1 (there is exactly one way to choose nothing), and the formula C(n,0) = n!/(0! × n!) = 1 requires 0! = 1.

Factorials grow so fast that even moderate n makes exhaustive enumeration infeasible:

Factorial Growth (log scale)

1 10² 10⁴ 10⁶

1! 2! 3! 4! 5! 6! 7! 8! 10!

Stirling's approximation — for large n where factorial computation overflows, the log-factorial is approximated as:

n! ≈ √(2πn) × (n/e)ⁿ ln(n!) ≈ n·ln(n) − n + ½·ln(2πn)

This appears in Bayesian statistics and MLE derivations where you need log-likelihoods involving large n (e.g., ln C(n, k) = ln(n!) - ln(k!) - ln((n-k)!)).


Permutations — Order Matters

How many ways can we arrange r items chosen from n distinct items, where the order of selection matters?

Deriving from first principles:

  • Position 1: n choices
  • Position 2: n−1 choices (one already used)
  • Position 3: n−2 choices
  • ...
  • Position r: n−r+1 choices

Total: n × (n−1) × ... × (n−r+1) = n! / (n−r)!

Formula: P(n, r) = n! / (n−r)!

Concrete ML example: Assign 3 different roles (lead engineer, reviewer, documenter) to 3 of 10 engineers. Each role is distinct — order matters.

P(10, 3) = 10! / (10−3)! = 10! / 7! = 10 × 9 × 8 = 720

720 different ordered assignments exist.

All permutations (r = n): P(n, n) = n!

Ordering 5 CV fold evaluation runs: P(5, 5) = 5! = 120 possible orderings.

When does order matter? Order matters when positions are distinguishable — first place vs second place vs third place are different outcomes. For the engineer assignment: "Alice leads, Bob reviews" is different from "Bob leads, Alice reviews."


Combinations — Order Does Not Matter

How many ways can we choose r items from n distinct items, when we only care which items are chosen, not their sequence?

Deriving from permutations: P(n, r) counts every ordering of each r-item group. Since we don't care about order, we divide by the number of ways to arrange the r chosen items: r!

Formula: C(n, r) = P(n, r) / r! = n! / (r! × (n−r)!)

Also written ⁿCᵣ or (n choose r).

Concrete ML example: Choosing 3 configurations from 9 to evaluate first (we care which 3, not in what order):

C(9, 3) = 9! / (3! × 6!) = (9 × 8 × 7) / (3 × 2 × 1) = 504 / 6 = 84

84 possible subsets of 3 configurations.

Feature selection example: Choosing 4 features from 10 for a feature subset:

C(10, 4) = 10! / (4! × 6!) = (10 × 9 × 8 × 7) / 24 = 5040 / 24 = 210

The SVG below shows why dividing by r! is correct — the same 3 items appear in all 6 orderings as a single combination:

Permutations (order matters) P(3,3) = 6 distinct sequences

A B C A C B B A C B C A C A B C B A

All 6 are different

÷ 3! = 6

Combination (order ignored) C(3,3) = 1 unique set

{A, B, C} same set regardless of order

All 6 orderings collapse to 1

Symmetry property: C(n, r) = C(n, n−r)

Choosing 3 from 9 is equivalent to choosing which 6 to leave out — same result:

C(9, 3) = 9! / (3! × 6!) = 84 C(9, 6) = 9! / (6! × 3!) = 84 ✓

Pascal's Triangle and Pascal's Identity

Pascal's identity: C(n, r) = C(n−1, r−1) + C(n−1, r)

Intuition: consider whether item n is included. If yes, choose r−1 from the remaining n−1. If no, choose r from n−1. The two cases are mutually exclusive and exhaustive.

Pascal's triangle puts this into visual form — each entry is the sum of the two entries above it, and the row-n entries are C(n, 0), C(n, 1), ..., C(n, n):

Row 0: 1 Row 1: 1 1 Row 2: 1 2 1 Row 3: 1 3 3 1 Row 4: 1 4 6 4 1

Row 4 reads: C(4,0)=1, C(4,1)=4, C(4,2)=6, C(4,3)=4, C(4,4)=1.

These are the binomial coefficients — they appear directly in the Binomial distribution formula.


Binomial Coefficient → Binomial Distribution

The payoff for learning combinations: C(n, k) counts the number of sequences with exactly k successes in n trials. Multiply by the probability of any one such sequence:

P(X = k) = C(n, k) × pᵏ × (1−p)ⁿ⁻ᵏ

Concrete example: What is the probability that exactly 4 out of 6 CV folds score above 0.85, assuming each fold has a 50% chance independently? (p=0.5 used for illustration — in practice you'd estimate p from data.)

C(6, 4) = 6! / (4! × 2!) = (6 × 5) / (2 × 1) = 15 P(X = 4) = 15 × 0.5⁴ × 0.5² = 15 × 0.0625 × 0.25 = 0.234

There are 15 ways to pick which 4 folds score above 0.85. Each such pattern has probability 0.5⁶ = 0.015625. Total: 15 × 0.015625 = 0.234.

Without C(6, 4) there is no way to count the 15 favorable sequences — the probability formula has no numerator.


Permutations with Repetition

When items can be reused — n choices at each of r positions:

Formula: n^r

  • Hex color codes (16 characters 0-9 and A-F, 6 positions): 16⁶ = 16,777,216
  • Hidden state sequences of length 5 from a vocabulary of 10 tokens: 10⁵ = 100,000
  • Password with 26 lowercase letters, length 8: 26⁸ ≈ 208 billion

Combinations with Repetition (Stars and Bars)

Choosing r items from n categories where repetition is allowed and order does not matter:

Formula: C(n + r − 1, r)

ML example: Allocating 10 attention heads across 3 transformer layers (a layer can have 0 heads):

C(3 + 10 − 1, 10) = C(12, 10) = C(12, 2) = (12 × 11) / 2 = 66

There are 66 distinct ways to distribute 10 identical heads across 3 layers. The "stars and bars" intuition: picture 10 stars (heads) separated by 2 bars (dividers between 3 layers) — count the arrangements of 12 symbols with 2 bars.


Combinatorial Explosion

The practical lesson for ML:

  • Feature selection: 30 features → 2³⁰ ≈ 1 billion possible subsets
  • Hyperparameter search: 5 parameters × 10 values each → 10⁵ = 100,000 configurations
  • Neural architecture search: even a modest 20-layer network with 5 choices per layer → 5²⁰ ≈ 95 trillion

This is why greedy forward selection, genetic algorithms, random search, and Bayesian optimization all exist — they navigate combinatorially large spaces without checking every point.

python
from math import comb, perm

n = 30
print(f"Subsets of {n} features (2^n):      {2**n:>20,}")
print(f"Permutations P({n}, 10):              {perm(n, 10):>20,}")
print(f"Combinations C({n}, 10):              {comb(n, 10):>20,}")
print()
# Grid search: 5 hyperparameters, 10 values each
configs = 10**5
print(f"Grid search (5 params × 10 values): {configs:>20,}")

# Binomial coefficient example
print(f"\nC(6, 4) = {comb(6, 4)}")
print(f"P(exactly 4 of 6 folds above 0.85, p=0.5) = {comb(6,4) * 0.5**6:.4f}")

# Combinations with repetition (stars and bars)
print(f"\nAllocate 10 heads across 3 layers: C(12, 10) = {comb(12, 10)}")
Subsets of 30 features (2^n): 1,073,741,824 Permutations P(30, 10): 109,027,350,432,000 Combinations C(30, 10): 30,045,015 Grid search (5 params × 10 values): 100,000 C(6, 4) = 15 P(exactly 4 of 6 folds above 0.85, p=0.5) = 0.2344 Allocate 10 heads across 3 layers: C(12, 10) = 66

Counting principles are the discrete foundation of probability theory. Once you can count favorable and total outcomes, you can compute exact probabilities — leading directly to the Binomial distribution (probability of k successes in n trials), the Hypergeometric distribution (sampling without replacement), and the Multinomial distribution (k categories in n trials). Permutations appear in ranking problems and ordered hypothesis testing. Combinations appear in cross-validation fold selection, feature subset enumeration, and network architecture counting. Stirling's approximation connects combinatorics to information theory — log-likelihoods involve sums of log-factorials.

When This Breaks Down

Counting formulas assume all items are distinguishable and selections are uniform (each outcome equally likely). If items are identical, overcounting occurs — the formula changes. If items have different probabilities of being selected, weighted counting is needed. For very large n, exact factorial computation overflows standard floating-point (10^309 for float64); use math.lgamma (log-gamma) or scipy.special.comb(n, k, exact=True) instead. Finally: counting gives you the denominator of a probability — it does not tell you whether the equally-likely assumption is valid.

Test Your Understanding

  1. A grid search has 4 hyperparameters: learning rate (5 options), batch size (3 options), dropout (4 options), optimizer (2 options). How many total configurations exist? If you randomly evaluate 1 configuration, what is the probability it has the optimal learning rate?

  2. You have 8 candidate models and need to choose 3 to present to stakeholders. How many possible presentation sets exist? If you also need to decide which of the 3 to present first, second, and third, how many ordered presentations exist?

  3. Derive the symmetry property C(n, r) = C(n, n−r) algebraically from the formula. Then show numerically for C(7, 2) and C(7, 5).

  4. Using Pascal's identity C(n, r) = C(n−1, r−1) + C(n−1, r), compute C(5, 2) without using the factorial formula. Show each step.

  5. A model predicts correctly on each of 10 independent test cases with probability p = 0.8. Compute P(X = 8) — the probability of exactly 8 correct predictions. What counting principle makes this calculation possible?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment