~/blog

Convolution Operation

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The convolution operation is the core computation in every convolutional neural network. Every "layer" of a CNN is just sliding a small matrix (the filter) across a larger matrix (the input), computing an element-wise product and sum at each position. That's it. Everything else in CNNs — multiple filters, multiple channels, stacked layers — is repetition of this one operation.

Anchor: 5×5 input image, 3×3 kernel, stride=1, no padding.

text
Input (5×5):             Kernel (3×3):
10  20  30  40  50      1   0  -1
60  70  80  90 100      1   0  -1
110 120 130 140 150      1   0  -1
160 170 180 190 200
210 220 230 240 250

Phase 1 — Position the Filter

Place the 3×3 kernel over the top-left 3×3 patch of the input. Element-wise multiply each overlapping pair, then sum.

Position (0,0) — top-left corner:

text
Patch:           Kernel:          Product:
10   20   30     1    0   -1      10    0   -30
60   70   80  ×  1    0   -1  =   60    0   -80
110  120  130    1    0   -1     110    0  -130

Sum = 10 + 0 + (−30) + 60 + 0 + (−80) + 110 + 0 + (−130) = −60

Output[0,0] = −60

Convolution — Filter at Position (0,0) Input (5×5) 1020304050 60708090100 110120130140150 160170180190200 210220230240250 Kernel (3×3) 10-1 10-1 10-1 = Output value −60 Output[0,0] Yellow patch = active receptive field. Blue = kernel weights. Sum of products = output value.

Phase 2 — Slide Across

Move the kernel one step to the right (stride=1). The new patch is columns 1–3.

Position (0,1):

text
Patch:           Kernel:          Product:
20   30   40     1    0   -1      20    0   -40
70   80   90  ×  1    0   -1  =   70    0   -90
120  130  140    1    0   -1     120    0  -140

Sum = 20 + (−40) + 70 + (−90) + 120 + (−140) = −60

Output[0,1] = −60

Position (1,0) — move down one row:

text
Patch:
60   70   80
110  120  130
160  170  180

Sum = (60−80) + (110−130) + (160−180) = −20 + (−20) + (−20) = −60

Output[1,0] = −60

Position (1,1):

text
Patch:
70   80   90
120  130  140
170  180  190

Sum = (70−90) + (120−140) + (170−190) = −20 + (−20) + (−20) = −60

Output[1,1] = −60

This kernel is a vertical edge detector — it detects left−right intensity differences. Since this particular input is a gradient from left to right (all values increase rightward), the kernel fires with −60 everywhere. A different image would show positive responses where left > right and negative where right > left.


Phase 3 — Compute All 9 Output Values

Sliding the 3×3 kernel across a 5×5 input with stride=1 produces (5−3)/1 + 1 = 3 positions per dimension → 3×3 = 9 output values.

PositionPatch (col 0, col 2)CalculationOutput
(0,0)col0=[10,60,110], col2=[30,80,130](10−30)+(60−80)+(110−130)−60
(0,1)col1=[20,70,120], col3=[40,90,140](20−40)+(70−90)+(120−140)−60
(0,2)col2=[30,80,130], col4=[50,100,150](30−50)+(80−100)+(130−150)−60
(1,0)col0=[60,110,160], col2=[80,130,180](60−80)+(110−130)+(160−180)−60
(1,1)col1=[70,120,170], col3=[90,140,190](70−90)+(120−140)+(170−190)−60
(1,2)col2=[80,130,180], col4=[100,150,200](80−100)+(130−150)+(180−200)−60
(2,0)col0=[110,160,210], col2=[130,180,230](110−130)+(160−180)+(210−230)−60
(2,1)col1=[120,170,220], col3=[140,190,240](120−140)+(170−190)+(220−240)−60
(2,2)col2=[130,180,230], col4=[150,200,250](130−150)+(180−200)+(230−250)−60

Full output feature map:

text
−60  −60  −60
−60  −60  −60
−60  −60  −60

All outputs are −60. This makes sense: the input is a ramp (values increase by 10 each column) and the vertical edge kernel consistently measures the same left−right difference.


Output Size Formula

Given:

  • Input: n×n
  • Kernel: k×k
  • Padding: p
  • Stride: s

Output size = ⌊(n − k + 2p) / s⌋ + 1

For this anchor: ⌊(5 − 3 + 0) / 1⌋ + 1 = 2 + 1 = 3

Examples:

InputKernelPaddingStrideOutput
5×53×3013×3
5×53×3115×5
6×63×3014×4
224×2247×732112×112

Multiple Filters

One filter produces one feature map. A conv layer uses multiple filters to detect multiple features simultaneously.

8 filters applied to the 5×5 input → 8 separate 3×3 feature maps → stacked output: 3×3×8

Filter 1 might detect vertical edges, filter 2 horizontal edges, filter 3 diagonal edges, etc. The network learns which filters are useful during training.

Receptive field: each output value is computed from a k×k patch. After two conv layers with 3×3 kernels, each output value is influenced by a 5×5 patch of the original input. This grows with depth — in very deep networks, every output neuron eventually "sees" the entire input.

Multiple Filters → Stacked Feature Maps 5×5×1 Input 8 filters 3×3 each FM 1 FM 2 FM 3 FM 4 3×3×8 output Each FM = one filter's response to the full input Filter 1 detects: vertical edge ([[1,0,-1],[1,0,-1],[1,0,-1]]) Filter 2 detects: horizontal edge ([[1,1,1],[0,0,0],[-1,-1,-1]]) Each filter = one learned feature detector. n filters → n feature maps stacked along depth axis.

Code

python
import numpy as np

# Anchor
X = np.array([
    [ 10,  20,  30,  40,  50],
    [ 60,  70,  80,  90, 100],
    [110, 120, 130, 140, 150],
    [160, 170, 180, 190, 200],
    [210, 220, 230, 240, 250],
], dtype=float)

K = np.array([
    [ 1,  0, -1],
    [ 1,  0, -1],
    [ 1,  0, -1],
], dtype=float)

def conv2d(X, K, stride=1):
    n, k = X.shape[0], K.shape[0]
    out_size = (n - k) // stride + 1
    out = np.zeros((out_size, out_size))
    for i in range(out_size):
        for j in range(out_size):
            patch = X[i*stride:i*stride+k, j*stride:j*stride+k]
            out[i, j] = np.sum(patch * K)
    return out

output = conv2d(X, K)
print("Input shape:", X.shape)
print("Kernel shape:", K.shape)
print(f"Output shape: {output.shape}")
print("\nOutput feature map:")
print(output)
print(f"\nOutput size formula: ({X.shape[0]} - {K.shape[0]}) / 1 + 1 = {output.shape[0]}")

# Manual position (0,0)
patch_00 = X[:3, :3]
val_00 = np.sum(patch_00 * K)
print(f"\nManual check position (0,0): {val_00}")
text
Input shape: (5, 5)
Kernel shape: (3, 3)
Output shape: (3, 3)

Output feature map:
[[-60. -60. -60.]
 [-60. -60. -60.]
 [-60. -60. -60.]]

Output size formula: (5 - 3) / 1 + 1 = 3

Manual check position (0,0): -60.0

Where this builds from: Image representation (previous post) — the input is a tensor and the kernel is a small matrix. Understanding why pixels are numbers is prerequisite to understanding why element-wise products make sense.

Where this leads: Padding (next post) addresses the shrinking output size problem — every conv layer without padding reduces spatial dimensions. After padding, stride controls the rate of downsampling.


Honest Limitations

The convolution operation assumes translation invariance — the same filter fires at every position. This is appropriate for textures and object parts but can fail for problems where position matters. A digit classifier is fine with translation invariance; a model that needs to know where in an image an object is (localization) cannot rely on this assumption.

Larger kernels (5×5, 7×7) capture more context but are more expensive. A 5×5 kernel requires 25 multiplications per position vs 9 for a 3×3 kernel. Modern architectures prefer two stacked 3×3 conv layers (effective receptive field = 5×5) over one 5×5 layer — same receptive field, fewer parameters, and an extra nonlinearity.


Test Your Understanding

  1. Apply the kernel [[0,1,0],[0,1,0],[0,1,0]] to position (0,0) of the anchor 5×5 input. Show the patch, element-wise products, and sum. What does this kernel detect?

  2. The anchor input is a left-to-right ramp (values increase by 10 each column). What would the output look like if you applied the horizontal edge detector [[1,1,1],[0,0,0],[−1,−1,−1]] to the same input? Compute at least position (0,0) and (1,0).

  3. A 32×32 image is processed by: Conv(3×3, stride=1, padding=0) → Conv(3×3, stride=1, padding=0) → Conv(3×3, stride=1, padding=0). What is the output spatial size after each layer? After 3 layers, how many positions in the final feature map does each position correspond to in the original input?

  4. Using the output size formula, determine: for a 224×224 input with a 7×7 kernel, stride=2, and padding=3, what is the output size? This is exactly the first conv layer in ResNet — verify that it produces 112×112 output.

  5. A conv layer has 64 filters of size 3×3 applied to a single-channel input. How many learnable parameters does this layer have (include biases)? If the same input had 3 channels (RGB), how many parameters would the layer have? Show your calculation.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment