~/blog
Convolution Operation
The convolution operation is the core computation in every convolutional neural network. Every "layer" of a CNN is just sliding a small matrix (the filter) across a larger matrix (the input), computing an element-wise product and sum at each position. That's it. Everything else in CNNs — multiple filters, multiple channels, stacked layers — is repetition of this one operation.
Anchor: 5×5 input image, 3×3 kernel, stride=1, no padding.
Input (5×5): Kernel (3×3):
10 20 30 40 50 1 0 -1
60 70 80 90 100 1 0 -1
110 120 130 140 150 1 0 -1
160 170 180 190 200
210 220 230 240 250Phase 1 — Position the Filter
Place the 3×3 kernel over the top-left 3×3 patch of the input. Element-wise multiply each overlapping pair, then sum.
Position (0,0) — top-left corner:
Patch: Kernel: Product:
10 20 30 1 0 -1 10 0 -30
60 70 80 × 1 0 -1 = 60 0 -80
110 120 130 1 0 -1 110 0 -130Sum = 10 + 0 + (−30) + 60 + 0 + (−80) + 110 + 0 + (−130) = −60
Output[0,0] = −60
Phase 2 — Slide Across
Move the kernel one step to the right (stride=1). The new patch is columns 1–3.
Position (0,1):
Patch: Kernel: Product:
20 30 40 1 0 -1 20 0 -40
70 80 90 × 1 0 -1 = 70 0 -90
120 130 140 1 0 -1 120 0 -140Sum = 20 + (−40) + 70 + (−90) + 120 + (−140) = −60
Output[0,1] = −60
Position (1,0) — move down one row:
Patch:
60 70 80
110 120 130
160 170 180Sum = (60−80) + (110−130) + (160−180) = −20 + (−20) + (−20) = −60
Output[1,0] = −60
Position (1,1):
Patch:
70 80 90
120 130 140
170 180 190Sum = (70−90) + (120−140) + (170−190) = −20 + (−20) + (−20) = −60
Output[1,1] = −60
This kernel is a vertical edge detector — it detects left−right intensity differences. Since this particular input is a gradient from left to right (all values increase rightward), the kernel fires with −60 everywhere. A different image would show positive responses where left > right and negative where right > left.
Phase 3 — Compute All 9 Output Values
Sliding the 3×3 kernel across a 5×5 input with stride=1 produces (5−3)/1 + 1 = 3 positions per dimension → 3×3 = 9 output values.
| Position | Patch (col 0, col 2) | Calculation | Output |
|---|---|---|---|
| (0,0) | col0=[10,60,110], col2=[30,80,130] | (10−30)+(60−80)+(110−130) | −60 |
| (0,1) | col1=[20,70,120], col3=[40,90,140] | (20−40)+(70−90)+(120−140) | −60 |
| (0,2) | col2=[30,80,130], col4=[50,100,150] | (30−50)+(80−100)+(130−150) | −60 |
| (1,0) | col0=[60,110,160], col2=[80,130,180] | (60−80)+(110−130)+(160−180) | −60 |
| (1,1) | col1=[70,120,170], col3=[90,140,190] | (70−90)+(120−140)+(170−190) | −60 |
| (1,2) | col2=[80,130,180], col4=[100,150,200] | (80−100)+(130−150)+(180−200) | −60 |
| (2,0) | col0=[110,160,210], col2=[130,180,230] | (110−130)+(160−180)+(210−230) | −60 |
| (2,1) | col1=[120,170,220], col3=[140,190,240] | (120−140)+(170−190)+(220−240) | −60 |
| (2,2) | col2=[130,180,230], col4=[150,200,250] | (130−150)+(180−200)+(230−250) | −60 |
Full output feature map:
−60 −60 −60
−60 −60 −60
−60 −60 −60All outputs are −60. This makes sense: the input is a ramp (values increase by 10 each column) and the vertical edge kernel consistently measures the same left−right difference.
Output Size Formula
Given:
- Input: n×n
- Kernel: k×k
- Padding: p
- Stride: s
Output size = ⌊(n − k + 2p) / s⌋ + 1
For this anchor: ⌊(5 − 3 + 0) / 1⌋ + 1 = 2 + 1 = 3
Examples:
| Input | Kernel | Padding | Stride | Output |
|---|---|---|---|---|
| 5×5 | 3×3 | 0 | 1 | 3×3 |
| 5×5 | 3×3 | 1 | 1 | 5×5 |
| 6×6 | 3×3 | 0 | 1 | 4×4 |
| 224×224 | 7×7 | 3 | 2 | 112×112 |
Multiple Filters
One filter produces one feature map. A conv layer uses multiple filters to detect multiple features simultaneously.
8 filters applied to the 5×5 input → 8 separate 3×3 feature maps → stacked output: 3×3×8
Filter 1 might detect vertical edges, filter 2 horizontal edges, filter 3 diagonal edges, etc. The network learns which filters are useful during training.
Receptive field: each output value is computed from a k×k patch. After two conv layers with 3×3 kernels, each output value is influenced by a 5×5 patch of the original input. This grows with depth — in very deep networks, every output neuron eventually "sees" the entire input.
Code
import numpy as np
# Anchor
X = np.array([
[ 10, 20, 30, 40, 50],
[ 60, 70, 80, 90, 100],
[110, 120, 130, 140, 150],
[160, 170, 180, 190, 200],
[210, 220, 230, 240, 250],
], dtype=float)
K = np.array([
[ 1, 0, -1],
[ 1, 0, -1],
[ 1, 0, -1],
], dtype=float)
def conv2d(X, K, stride=1):
n, k = X.shape[0], K.shape[0]
out_size = (n - k) // stride + 1
out = np.zeros((out_size, out_size))
for i in range(out_size):
for j in range(out_size):
patch = X[i*stride:i*stride+k, j*stride:j*stride+k]
out[i, j] = np.sum(patch * K)
return out
output = conv2d(X, K)
print("Input shape:", X.shape)
print("Kernel shape:", K.shape)
print(f"Output shape: {output.shape}")
print("\nOutput feature map:")
print(output)
print(f"\nOutput size formula: ({X.shape[0]} - {K.shape[0]}) / 1 + 1 = {output.shape[0]}")
# Manual position (0,0)
patch_00 = X[:3, :3]
val_00 = np.sum(patch_00 * K)
print(f"\nManual check position (0,0): {val_00}")Input shape: (5, 5)
Kernel shape: (3, 3)
Output shape: (3, 3)
Output feature map:
[[-60. -60. -60.]
[-60. -60. -60.]
[-60. -60. -60.]]
Output size formula: (5 - 3) / 1 + 1 = 3
Manual check position (0,0): -60.0Related Concepts
Where this builds from: Image representation (previous post) — the input is a tensor and the kernel is a small matrix. Understanding why pixels are numbers is prerequisite to understanding why element-wise products make sense.
Where this leads: Padding (next post) addresses the shrinking output size problem — every conv layer without padding reduces spatial dimensions. After padding, stride controls the rate of downsampling.
Honest Limitations
The convolution operation assumes translation invariance — the same filter fires at every position. This is appropriate for textures and object parts but can fail for problems where position matters. A digit classifier is fine with translation invariance; a model that needs to know where in an image an object is (localization) cannot rely on this assumption.
Larger kernels (5×5, 7×7) capture more context but are more expensive. A 5×5 kernel requires 25 multiplications per position vs 9 for a 3×3 kernel. Modern architectures prefer two stacked 3×3 conv layers (effective receptive field = 5×5) over one 5×5 layer — same receptive field, fewer parameters, and an extra nonlinearity.
Test Your Understanding
-
Apply the kernel [[0,1,0],[0,1,0],[0,1,0]] to position (0,0) of the anchor 5×5 input. Show the patch, element-wise products, and sum. What does this kernel detect?
-
The anchor input is a left-to-right ramp (values increase by 10 each column). What would the output look like if you applied the horizontal edge detector [[1,1,1],[0,0,0],[−1,−1,−1]] to the same input? Compute at least position (0,0) and (1,0).
-
A 32×32 image is processed by: Conv(3×3, stride=1, padding=0) → Conv(3×3, stride=1, padding=0) → Conv(3×3, stride=1, padding=0). What is the output spatial size after each layer? After 3 layers, how many positions in the final feature map does each position correspond to in the original input?
-
Using the output size formula, determine: for a 224×224 input with a 7×7 kernel, stride=2, and padding=3, what is the output size? This is exactly the first conv layer in ResNet — verify that it produces 112×112 output.
-
A conv layer has 64 filters of size 3×3 applied to a single-channel input. How many learnable parameters does this layer have (include biases)? If the same input had 3 channels (RGB), how many parameters would the layer have? Show your calculation.