~/blog
Max, Min and Average Pooling
After a convolution layer produces a feature map, pooling reduces its spatial dimensions while keeping the most useful information. Without pooling (or strided convolutions), feature maps stay at nearly the same spatial size as the input all the way to the fully connected layers — requiring enormous FC layers with millions of parameters. Pooling solves this.
Anchor: 4×4 feature map (output from a previous conv layer). Pool window: 2×2, stride=2.
Feature map (4×4):
12 20 3 7
8 15 11 4
6 9 14 18
2 10 5 16Max Pooling
Take the maximum value in each non-overlapping 2×2 window.
With 2×2 window and stride=2, the 4×4 map divides into 4 non-overlapping quadrants:
Top-left quadrant: rows 0–1, cols 0–1 values = [12, 20, 8, 15] → max = 20
Top-right quadrant: rows 0–1, cols 2–3 values = [3, 7, 11, 4] → max = 11
Bottom-left quadrant: rows 2–3, cols 0–1 values = [6, 9, 2, 10] → max = 10
Bottom-right quadrant: rows 2–3, cols 2–3 values = [14, 18, 5, 16] → max = 18
Max pool output: [[20, 11], [10, 18]]
The 4×4 map is compressed to 2×2 — 4× fewer values. The strongest activations are preserved. If a filter detected a strong edge in the top-left quadrant (value 20), that detection is retained regardless of exactly which of the 4 pixels had the highest response.
Average Pooling
Take the mean of each 2×2 window.
Top-left: (12 + 20 + 8 + 15) / 4 = 55 / 4 = 13.75
Top-right: (3 + 7 + 11 + 4) / 4 = 25 / 4 = 6.25
Bottom-left: (6 + 9 + 2 + 10) / 4 = 27 / 4 = 6.75
Bottom-right: (14 + 18 + 5 + 16) / 4 = 53 / 4 = 13.25
Average pool output: [[13.75, 6.25], [6.75, 13.25]]
Average pooling retains the overall activation level of each region, not just the peak. Useful when you care about the "average presence" of a feature in a region, not whether a single pixel had a very strong response.
Min Pooling
Take the minimum of each 2×2 window.
Top-left: min(12, 20, 8, 15) = 8
Top-right: min(3, 7, 11, 4) = 3
Bottom-left: min(6, 9, 2, 10) = 2
Bottom-right: min(14, 18, 5, 16) = 5
Min pool output: [[8, 3], [2, 5]]
Min pooling identifies the weakest activation in each region — useful for detecting dark spots, shadows, or regions where a feature is absent. Less common in practice; max pooling is the default choice.
Why Max Pooling Gives Translation Invariance
A feature detected in the top-left quadrant at exactly pixel (0,0) or slightly shifted to (0,1) produces the same max pool output — the maximum of the quadrant is the same regardless of which pixel activated most strongly.
To see this: shift the strong activation 20 from position (0,1) to position (1,0) in the top-left quadrant — the max pool output is still 20. The detector is now "position-agnostic" within each pool region.
Global Average Pooling (GAP)
GAP is the extreme case: pool window = entire feature map → one scalar per channel.
For the anchor 4×4 feature map:
GAP = mean of all 16 values = (12+20+3+7+8+15+11+4+6+9+14+18+2+10+5+16) / 16 = 160/16 = 9.375
GAP replaces the flatten+FC step in modern CNNs (ResNet, MobileNet). For a CNN with 512 channels, GAP produces a 512-dimensional vector — one value per channel — which then feeds into a small FC layer.
Benefits: dramatically fewer parameters than flatten+FC (no weights per spatial position), built-in regularization (averaging is smoother than selecting individual feature values), model can accept any input size.
Comparison Table
| Pooling | Retains | Best use | Limitation |
|---|---|---|---|
| Max | Strongest activation | Feature detection (most common) | Discards exact position |
| Average | Overall activation level | Global context, GAP, classification head | Dilutes strong activations |
| Min | Weakest activation | Anomaly detection, dark patterns | Rarely used |
| Global Avg (GAP) | Mean per channel | ResNet, MobileNet final stage | Loses all spatial info |
Code
import numpy as np
fm = np.array([[12, 20, 3, 7],
[ 8, 15,11, 4],
[ 6, 9,14,18],
[ 2, 10, 5,16]])
def pool2d(fm, size=2, stride=2, mode='max'):
fn = {'max': np.max, 'avg': np.mean, 'min': np.min}[mode]
out_h = (fm.shape[0] - size) // stride + 1
out_w = (fm.shape[1] - size) // stride + 1
out = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
out[i,j] = fn(fm[i*stride:i*stride+size, j*stride:j*stride+size])
return out
print("Feature map:")
print(fm)
print("\nMax pool:")
print(pool2d(fm, mode='max'))
print("\nAvg pool:")
print(pool2d(fm, mode='avg'))
print("\nMin pool:")
print(pool2d(fm, mode='min'))
print(f"\nGAP: {fm.mean():.4f}")Feature map:
[[12 20 3 7]
[ 8 15 11 4]
[ 6 9 14 18]
[ 2 10 5 16]]
Max pool:
[[20. 11.]
[10. 18.]]
Avg pool:
[[13.75 6.25]
[ 6.75 13.25]]
Min pool:
[[8. 3.]
[2. 5.]]
GAP: 9.3750Related Concepts
Where this builds from: Convolution produces feature maps (post 04). Pooling operates on those feature maps to reduce their spatial dimensions.
Where this leads: After pooling, the feature maps are small enough to be flattened and fed into a fully connected classification head (next post). Global average pooling is an alternative to flatten+FC that appears in modern architectures (post 09, full CNN pipeline).
Honest Limitations
Max pooling discards the exact position of a feature within each pool window. It reports that a feature exists, not where. For image classification (does the image contain a cat?) this is fine. For object detection (where is the cat?) or semantic segmentation (which pixels belong to the cat?), max pooling loses critical spatial information. Capsule networks were proposed specifically to address this.
Aggressive pooling (large stride or window) loses too much spatial resolution. Modern segmentation models (U-Net, DeepLab) use skip connections to reintroduce fine-grained spatial information that pooling discards. Excessive downsampling before the output is the primary cause of blurry segmentation masks.
Average pooling can dilute strong activations. A single strong feature response (value 18) in a 2×2 pool window is averaged with three weak responses, producing 13.25 instead of 18. If the task depends on detecting rare, strong activations, max pooling is safer.
Test Your Understanding
-
Apply 3×3 max pooling (stride=3, no overlap) to the anchor 4×4 feature map. How many pool positions fit? What is the output? Compare to the 2×2 pooling output — which loses more information?
-
The translation invariance argument for max pooling says a feature shifted within a pool window produces the same output. Show a case where this fails: construct a feature map where shifting a strong activation by 1 pixel changes the max pool output.
-
Global average pooling over a 4×4 feature map produces 1 value. GAP over a 4×4×8 feature map (8 channels) produces 8 values. How many parameters does a FC layer (GAP output → 10 classes) require? Compare this to a FC layer (flatten 4×4×8 = 128 → 10 classes). What is the parameter reduction ratio?
-
Average pooling is used as the final pooling step before the classifier in some architectures (e.g., GoogLeNet). What is the intuition for why average pooling might generalize better than max pooling at the final stage (just before the classification head)?
-
You are designing a CNN for medical image segmentation. The model must output a pixel-wise mask at the same resolution as the input (256×256). If you use three rounds of 2×2 max pooling with stride=2, what is the feature map size after each pooling layer? Why is this a problem for segmentation, and what architectural solution would you apply?