~/blog

Padding in CNN

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Every convolution layer without padding shrinks the spatial dimensions. A 5×5 input with a 3×3 kernel becomes 3×3. Apply another 3×3 conv to the 3×3 output — it becomes 1×1. Three conv layers with no padding and a 5×5 input produce a single value. The image is gone before deep features can be learned.

Padding is the solution. It adds border pixels (usually zeros) around the input before convolution, giving you control over whether the output grows, stays the same, or shrinks.

Anchor: 5×5 input, 3×3 kernel, stride=1.


The Shrinking Problem

Without padding, spatial dimensions shrink at every conv layer:

LayerInput sizeKernelOutput size
15×53×33×3
23×33×31×1

After just 2 conv layers, the 5×5 image has been compressed to a single value. In a real deep CNN with 50+ layers, images would vanish after layer 2 without padding.

Additionally, corner pixels appear in only 1 position (when the kernel aligns exactly over them), while center pixels appear in up to k² positions. Without padding, edge information contributes far less to the output than center information.


Valid Padding (No Padding)

No zeros added. Output shrinks.

Output size = (5 − 3) / 1 + 1 = 3×3

The 5×5 input is used as-is. The kernel is positioned at 3×3 = 9 locations. Corner pixel (0,0) of the input appears in exactly 1 convolution computation. Center pixel (2,2) appears in 9 computations (the kernel reaches it from all 9 positions).

This asymmetry means the model learns less about edge regions — edge pixels contribute fewer times to any output value.

Use valid padding when you intentionally want to reduce spatial dimensions — for example, before a pooling layer when you want both operations to downsample together.


Same Padding

Adds p = (k − 1) / 2 zeros on each side. For k=3: p = (3−1)/2 = 1 zero per side.

The padded input is (5 + 2×1) × (5 + 2×1) = 7×7.

The convolution on the 7×7 padded input: (7 − 3) / 1 + 1 = 5×5 — same as the original input.

Here is the anchor 5×5 input with 1-pixel zero padding applied (producing 7×7):

text
Zero-padded 7×7 input:
  0    0    0    0    0    0    0
  0   10   20   30   40   50    0
  0   60   70   80   90  100    0
  0  110  120  130  140  150    0
  0  160  170  180  190  200    0
  0  210  220  230  240  250    0
  0    0    0    0    0    0    0

Row 0 and row 6 are all zeros. Column 0 and column 6 are all zeros. The original data occupies rows 1–5, columns 1–5.

With same padding, the kernel can now be centered on every pixel of the original input, including corner pixels. Output[0,0] corresponds to the kernel centered on input pixel (0,0), which now has zeros around it.

Valid vs Same Padding — Anchor 5×5 Input Valid (no padding) → 5×5 1020304050 60708090100 110120130140150 160170180190200 210220230240250 Output: 3×3 (shrinks) Same padding (p=1) → 7×7 0000000 010203040500 0607080901000 01101201301401500 01601701801902000 02102202302402500 0000000 Output: 5×5 (preserved) — green border = added zeros

Other Padding Types

Reflect padding: instead of zeros, mirrors the input values at the border. Pixel (0,1) of the padded input gets the value of input pixel (1,1). This avoids the zero-value bias at edges — the border statistics match the nearby image content.

Replicate (clamp) padding: repeats the edge pixel value. The leftmost column of padding gets the same values as the input's leftmost column. Useful in style transfer and image generation tasks.


Summary Table

Paddingp valueOutput size (5×5 input, 3×3 kernel)Use case
Valid03×3Final conv before pooling (want reduction)
Same15×5Intermediate layers (preserve spatial dims)
Full27×7Signal processing (all overlapping positions)
Reflect15×5Style transfer, image generation (border fidelity)

Typical CNN pattern: same padding throughout the convolutional stack → pooling (or strided conv) to downsample. This way, the spatial dimensions only shrink when you want them to (at pooling layers), not continuously through every conv layer.


Code

python
import numpy as np

def pad_same(X, k):
    p = (k - 1) // 2
    return np.pad(X, p, mode='constant', constant_values=0)

def conv2d(X, K, stride=1):
    n, k = X.shape[0], K.shape[0]
    out = (n - k) // stride + 1
    result = np.zeros((out, out))
    for i in range(out):
        for j in range(out):
            result[i,j] = np.sum(X[i*stride:i*stride+k, j*stride:j*stride+k] * K)
    return result

X = np.arange(25).reshape(5,5).astype(float) + 10  # same as anchor range
K = np.array([[1, 0, -1],[1, 0, -1],[1, 0, -1]], dtype=float)

X_padded = pad_same(X, k=3)
print("Original (5×5):")
print(X.astype(int))
print(f"\nSame-padded (7×7, p=1):")
print(X_padded.astype(int))
print(f"\nOutput size (valid): {(5-3)//1+1}×{(5-3)//1+1}")
print(f"Output size (same):  {(5-3+2)//1+1}×{(5-3+2)//1+1}")

out_valid = conv2d(X, K)
out_same  = conv2d(X_padded, K)
print(f"\nValid conv output shape: {out_valid.shape}")
print(f"Same conv output shape:  {out_same.shape}")
text
Original (5×5):
[[ 10  11  12  13  14]
 [ 15  16  17  18  19]
 [ 20  21  22  23  24]
 [ 25  26  27  28  29]
 [ 30  31  32  33  34]]

Same-padded (7×7, p=1):
[[ 0  0  0  0  0  0  0]
 [ 0 10 11 12 13 14  0]
 [ 0 15 16 17 18 19  0]
 [ 0 20 21 22 23 24  0]
 [ 0 25 26 27 28 29  0]
 [ 0 30 31 32 33 34  0]
 [ 0  0  0  0  0  0  0]]

Output size (valid): 3×3
Output size (same):  5×5

Valid conv output shape: (3, 3)
Same conv output shape:  (5, 5)

Same padding on the 7×7 input produces a 5×5 output — matching the original input spatial dimensions exactly.


Where this builds from: The convolution output size formula (previous post) — (n − k + 2p) / s + 1 — is where p=0 gives valid and p=(k−1)/2 gives same.

Where this leads: Stride works together with padding to control spatial downsampling. With same padding and stride=2, the output is half the input size — a common pattern used instead of pooling in modern architectures like ResNet.


Honest Limitations

Zero padding introduces artificial border values that can affect features near edges. The CNN sees zeros at the border that don't represent real image content. This can cause artifacts in feature detection near image boundaries — features "near" the zero border differ from the same features in the image center. In practice, this effect is small for large images but noticeable for small inputs or shallow networks.

Same padding increases computation vs valid. A 7×7 padded input requires 25 convolution positions (5×5) vs 9 (3×3) for valid. For deep networks on large images, same padding adds significant computation. Some architectures (especially on edge devices) use valid padding with carefully planned output sizes to reduce cost.


Test Your Understanding

  1. A network applies 10 consecutive 3×3 conv layers with valid padding to a 28×28 input (no pooling). What is the spatial size of the output after each layer? After 10 layers, does any spatial information remain? What padding would you use to prevent this?

  2. For a 7×7 kernel with same padding, compute p. What is the padded input size for a 6×6 original? Verify that the padded convolution produces a 6×6 output using the output size formula.

  3. You apply same padding then stride=2. For a 6×6 input with a 3×3 kernel, compute: (a) the padding amount, (b) the padded input size, (c) the output size with stride=2. Compare to applying valid padding with stride=2. Which is used in ResNet's first layer, and why?

  4. Zero padding adds artificial values that the network has not seen in training. Propose an alternative that doesn't introduce artificial values. How would you implement this in NumPy? What are its disadvantages vs zero padding?

  5. A CNN trained on 224×224 images with same padding throughout is tested on 112×112 images. Would it work? Would the output size change? What if the same model were tested on 448×448 images? How does this scale differently for valid vs same padding?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment