Every convolution layer without padding shrinks the spatial dimensions. A 5×5 input with a 3×3 kernel becomes 3×3. Apply another 3×3 conv to the 3×3 output — it becomes 1×1. Three conv layers with no padding and a 5×5 input produce a single value. The image is gone before deep features can be learned.
Padding is the solution. It adds border pixels (usually zeros) around the input before convolution, giving you control over whether the output grows, stays the same, or shrinks.
Anchor: 5×5 input, 3×3 kernel, stride=1.
The Shrinking Problem
Without padding, spatial dimensions shrink at every conv layer:
| Layer | Input size | Kernel | Output size |
|---|---|---|---|
| 1 | 5×5 | 3×3 | 3×3 |
| 2 | 3×3 | 3×3 | 1×1 |
After just 2 conv layers, the 5×5 image has been compressed to a single value. In a real deep CNN with 50+ layers, images would vanish after layer 2 without padding.
Additionally, corner pixels appear in only 1 position (when the kernel aligns exactly over them), while center pixels appear in up to k² positions. Without padding, edge information contributes far less to the output than center information.
Valid Padding (No Padding)
No zeros added. Output shrinks.
Output size = (5 − 3) / 1 + 1 = 3×3
The 5×5 input is used as-is. The kernel is positioned at 3×3 = 9 locations. Corner pixel (0,0) of the input appears in exactly 1 convolution computation. Center pixel (2,2) appears in 9 computations (the kernel reaches it from all 9 positions).
This asymmetry means the model learns less about edge regions — edge pixels contribute fewer times to any output value.
Use valid padding when you intentionally want to reduce spatial dimensions — for example, before a pooling layer when you want both operations to downsample together.
Same Padding
Adds p = (k − 1) / 2 zeros on each side. For k=3: p = (3−1)/2 = 1 zero per side.
The padded input is (5 + 2×1) × (5 + 2×1) = 7×7.
The convolution on the 7×7 padded input: (7 − 3) / 1 + 1 = 5×5 — same as the original input.
Here is the anchor 5×5 input with 1-pixel zero padding applied (producing 7×7):
Zero-padded 7×7 input:
0 0 0 0 0 0 0
0 10 20 30 40 50 0
0 60 70 80 90 100 0
0 110 120 130 140 150 0
0 160 170 180 190 200 0
0 210 220 230 240 250 0
0 0 0 0 0 0 0Row 0 and row 6 are all zeros. Column 0 and column 6 are all zeros. The original data occupies rows 1–5, columns 1–5.
With same padding, the kernel can now be centered on every pixel of the original input, including corner pixels. Output[0,0] corresponds to the kernel centered on input pixel (0,0), which now has zeros around it.
Other Padding Types
Reflect padding: instead of zeros, mirrors the input values at the border. Pixel (0,1) of the padded input gets the value of input pixel (1,1). This avoids the zero-value bias at edges — the border statistics match the nearby image content.
Replicate (clamp) padding: repeats the edge pixel value. The leftmost column of padding gets the same values as the input's leftmost column. Useful in style transfer and image generation tasks.
Summary Table
| Padding | p value | Output size (5×5 input, 3×3 kernel) | Use case |
|---|---|---|---|
| Valid | 0 | 3×3 | Final conv before pooling (want reduction) |
| Same | 1 | 5×5 | Intermediate layers (preserve spatial dims) |
| Full | 2 | 7×7 | Signal processing (all overlapping positions) |
| Reflect | 1 | 5×5 | Style transfer, image generation (border fidelity) |
Typical CNN pattern: same padding throughout the convolutional stack → pooling (or strided conv) to downsample. This way, the spatial dimensions only shrink when you want them to (at pooling layers), not continuously through every conv layer.
Code
import numpy as np
def pad_same(X, k):
p = (k - 1) // 2
return np.pad(X, p, mode='constant', constant_values=0)
def conv2d(X, K, stride=1):
n, k = X.shape[0], K.shape[0]
out = (n - k) // stride + 1
result = np.zeros((out, out))
for i in range(out):
for j in range(out):
result[i,j] = np.sum(X[i*stride:i*stride+k, j*stride:j*stride+k] * K)
return result
X = np.arange(25).reshape(5,5).astype(float) + 10 # same as anchor range
K = np.array([[1, 0, -1],[1, 0, -1],[1, 0, -1]], dtype=float)
X_padded = pad_same(X, k=3)
print("Original (5×5):")
print(X.astype(int))
print(f"\nSame-padded (7×7, p=1):")
print(X_padded.astype(int))
print(f"\nOutput size (valid): {(5-3)//1+1}×{(5-3)//1+1}")
print(f"Output size (same): {(5-3+2)//1+1}×{(5-3+2)//1+1}")
out_valid = conv2d(X, K)
out_same = conv2d(X_padded, K)
print(f"\nValid conv output shape: {out_valid.shape}")
print(f"Same conv output shape: {out_same.shape}")Original (5×5):
[[ 10 11 12 13 14]
[ 15 16 17 18 19]
[ 20 21 22 23 24]
[ 25 26 27 28 29]
[ 30 31 32 33 34]]
Same-padded (7×7, p=1):
[[ 0 0 0 0 0 0 0]
[ 0 10 11 12 13 14 0]
[ 0 15 16 17 18 19 0]
[ 0 20 21 22 23 24 0]
[ 0 25 26 27 28 29 0]
[ 0 30 31 32 33 34 0]
[ 0 0 0 0 0 0 0]]
Output size (valid): 3×3
Output size (same): 5×5
Valid conv output shape: (3, 3)
Same conv output shape: (5, 5)Same padding on the 7×7 input produces a 5×5 output — matching the original input spatial dimensions exactly.
Related Concepts
Where this builds from: The convolution output size formula (previous post) — (n − k + 2p) / s + 1 — is where p=0 gives valid and p=(k−1)/2 gives same.
Where this leads: Stride works together with padding to control spatial downsampling. With same padding and stride=2, the output is half the input size — a common pattern used instead of pooling in modern architectures like ResNet.
Honest Limitations
Zero padding introduces artificial border values that can affect features near edges. The CNN sees zeros at the border that don't represent real image content. This can cause artifacts in feature detection near image boundaries — features "near" the zero border differ from the same features in the image center. In practice, this effect is small for large images but noticeable for small inputs or shallow networks.
Same padding increases computation vs valid. A 7×7 padded input requires 25 convolution positions (5×5) vs 9 (3×3) for valid. For deep networks on large images, same padding adds significant computation. Some architectures (especially on edge devices) use valid padding with carefully planned output sizes to reduce cost.
Test Your Understanding
-
A network applies 10 consecutive 3×3 conv layers with valid padding to a 28×28 input (no pooling). What is the spatial size of the output after each layer? After 10 layers, does any spatial information remain? What padding would you use to prevent this?
-
For a 7×7 kernel with same padding, compute p. What is the padded input size for a 6×6 original? Verify that the padded convolution produces a 6×6 output using the output size formula.
-
You apply same padding then stride=2. For a 6×6 input with a 3×3 kernel, compute: (a) the padding amount, (b) the padded input size, (c) the output size with stride=2. Compare to applying valid padding with stride=2. Which is used in ResNet's first layer, and why?
-
Zero padding adds artificial values that the network has not seen in training. Propose an alternative that doesn't introduce artificial values. How would you implement this in NumPy? What are its disadvantages vs zero padding?
-
A CNN trained on 224×224 images with same padding throughout is tested on 112×112 images. Would it work? Would the output size change? What if the same model were tested on 448×448 images? How does this scale differently for valid vs same padding?