~/blog

Max, Min and Average Pooling

Jul 1, 2026•8 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

After a convolution layer produces a feature map, pooling reduces its spatial dimensions while keeping the most useful information. Without pooling (or strided convolutions), feature maps stay at nearly the same spatial size as the input all the way to the fully connected layers — requiring enormous FC layers with millions of parameters. Pooling solves this.

Anchor: 4×4 feature map (output from a previous conv layer). Pool window: 2×2, stride=2.

text

Feature map (4×4):
12  20   3   7
 8  15  11   4
 6   9  14  18
 2  10   5  16

Max Pooling

Take the maximum value in each non-overlapping 2×2 window.

With 2×2 window and stride=2, the 4×4 map divides into 4 non-overlapping quadrants:

Top-left quadrant: rows 0–1, cols 0–1 values = [12, 20, 8, 15] → max = 20

Top-right quadrant: rows 0–1, cols 2–3 values = [3, 7, 11, 4] → max = 11

Bottom-left quadrant: rows 2–3, cols 0–1 values = [6, 9, 2, 10] → max = 10

Bottom-right quadrant: rows 2–3, cols 2–3 values = [14, 18, 5, 16] → max = 18

Max pool output: [[20, 11], [10, 18]]

The 4×4 map is compressed to 2×2 — 4× fewer values. The strongest activations are preserved. If a filter detected a strong edge in the top-left quadrant (value 20), that detection is retained regardless of exactly which of the 4 pixels had the highest response.

Average Pooling

Take the mean of each 2×2 window.

Top-left: (12 + 20 + 8 + 15) / 4 = 55 / 4 = 13.75

Top-right: (3 + 7 + 11 + 4) / 4 = 25 / 4 = 6.25

Bottom-left: (6 + 9 + 2 + 10) / 4 = 27 / 4 = 6.75

Bottom-right: (14 + 18 + 5 + 16) / 4 = 53 / 4 = 13.25

Average pool output: [[13.75, 6.25], [6.75, 13.25]]

Average pooling retains the overall activation level of each region, not just the peak. Useful when you care about the "average presence" of a feature in a region, not whether a single pixel had a very strong response.

Min Pooling

Take the minimum of each 2×2 window.

Top-left: min(12, 20, 8, 15) = 8

Top-right: min(3, 7, 11, 4) = 3

Bottom-left: min(6, 9, 2, 10) = 2

Bottom-right: min(14, 18, 5, 16) = 5

Min pool output: [[8, 3], [2, 5]]

Min pooling identifies the weakest activation in each region — useful for detecting dark spots, shadows, or regions where a feature is absent. Less common in practice; max pooling is the default choice.

Why Max Pooling Gives Translation Invariance

A feature detected in the top-left quadrant at exactly pixel (0,0) or slightly shifted to (0,1) produces the same max pool output — the maximum of the quadrant is the same regardless of which pixel activated most strongly.

To see this: shift the strong activation 20 from position (0,1) to position (1,0) in the top-left quadrant — the max pool output is still 20. The detector is now "position-agnostic" within each pool region.

Global Average Pooling (GAP)

GAP is the extreme case: pool window = entire feature map → one scalar per channel.

For the anchor 4×4 feature map:

GAP = mean of all 16 values = (12+20+3+7+8+15+11+4+6+9+14+18+2+10+5+16) / 16 = 160/16 = 9.375

GAP replaces the flatten+FC step in modern CNNs (ResNet, MobileNet). For a CNN with 512 channels, GAP produces a 512-dimensional vector — one value per channel — which then feeds into a small FC layer.

Benefits: dramatically fewer parameters than flatten+FC (no weights per spatial position), built-in regularization (averaging is smoother than selecting individual feature values), model can accept any input size.

Comparison Table

Pooling	Retains	Best use	Limitation
Max	Strongest activation	Feature detection (most common)	Discards exact position
Average	Overall activation level	Global context, GAP, classification head	Dilutes strong activations
Min	Weakest activation	Anomaly detection, dark patterns	Rarely used
Global Avg (GAP)	Mean per channel	ResNet, MobileNet final stage	Loses all spatial info

Code

python

import numpy as np

fm = np.array([[12, 20, 3, 7],
               [ 8, 15,11, 4],
               [ 6,  9,14,18],
               [ 2, 10, 5,16]])

def pool2d(fm, size=2, stride=2, mode='max'):
    fn = {'max': np.max, 'avg': np.mean, 'min': np.min}[mode]
    out_h = (fm.shape[0] - size) // stride + 1
    out_w = (fm.shape[1] - size) // stride + 1
    out = np.zeros((out_h, out_w))
    for i in range(out_h):
        for j in range(out_w):
            out[i,j] = fn(fm[i*stride:i*stride+size, j*stride:j*stride+size])
    return out

print("Feature map:")
print(fm)
print("\nMax pool:")
print(pool2d(fm, mode='max'))
print("\nAvg pool:")
print(pool2d(fm, mode='avg'))
print("\nMin pool:")
print(pool2d(fm, mode='min'))
print(f"\nGAP: {fm.mean():.4f}")

text

Feature map:
[[12 20  3  7]
 [ 8 15 11  4]
 [ 6  9 14 18]
 [ 2 10  5 16]]

Max pool:
[[20. 11.]
 [10. 18.]]

Avg pool:
[[13.75  6.25]
 [ 6.75 13.25]]

Min pool:
[[8. 3.]
 [2. 5.]]

GAP: 9.3750

Where this builds from: Convolution produces feature maps (post 04). Pooling operates on those feature maps to reduce their spatial dimensions.

Where this leads: After pooling, the feature maps are small enough to be flattened and fed into a fully connected classification head (next post). Global average pooling is an alternative to flatten+FC that appears in modern architectures (post 09, full CNN pipeline).

Honest Limitations

Max pooling discards the exact position of a feature within each pool window. It reports that a feature exists, not where. For image classification (does the image contain a cat?) this is fine. For object detection (where is the cat?) or semantic segmentation (which pixels belong to the cat?), max pooling loses critical spatial information. Capsule networks were proposed specifically to address this.

Aggressive pooling (large stride or window) loses too much spatial resolution. Modern segmentation models (U-Net, DeepLab) use skip connections to reintroduce fine-grained spatial information that pooling discards. Excessive downsampling before the output is the primary cause of blurry segmentation masks.

Average pooling can dilute strong activations. A single strong feature response (value 18) in a 2×2 pool window is averaged with three weak responses, producing 13.25 instead of 18. If the task depends on detecting rare, strong activations, max pooling is safer.

Test Your Understanding

Apply 3×3 max pooling (stride=3, no overlap) to the anchor 4×4 feature map. How many pool positions fit? What is the output? Compare to the 2×2 pooling output — which loses more information?
The translation invariance argument for max pooling says a feature shifted within a pool window produces the same output. Show a case where this fails: construct a feature map where shifting a strong activation by 1 pixel changes the max pool output.
Global average pooling over a 4×4 feature map produces 1 value. GAP over a 4×4×8 feature map (8 channels) produces 8 values. How many parameters does a FC layer (GAP output → 10 classes) require? Compare this to a FC layer (flatten 4×4×8 = 128 → 10 classes). What is the parameter reduction ratio?
Average pooling is used as the final pooling step before the classifier in some architectures (e.g., GoogLeNet). What is the intuition for why average pooling might generalize better than max pooling at the final stage (just before the classification head)?
You are designing a CNN for medical image segmentation. The model must output a pixel-wise mask at the same resolution as the input (256×256). If you use three rounds of 2×2 max pooling with stride=2, what is the feature map size after each pooling layer? Why is this a problem for segmentation, and what architectural solution would you apply?

Max, Min and Average Pooling

Max Pooling

Average Pooling

Min Pooling

Why Max Pooling Gives Translation Invariance

Global Average Pooling (GAP)

Comparison Table

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

Max, Min and Average Pooling

Max Pooling

Average Pooling

Min Pooling

Why Max Pooling Gives Translation Invariance

Global Average Pooling (GAP)

Comparison Table

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment