~/blog
Flattening and Fully Connected Layers
Convolution and pooling layers produce feature maps — 3D tensors with spatial structure (height × width × channels). A classifier needs to output "class 0, class 1, or class 2" — a fixed-size vector of scores, one per class. Nothing in a spatial tensor knows how to become that vector on its own. Flattening is the operation that bridges the two: it takes the spatial output of the conv/pool stack and reshapes it into a 1D vector that a standard fully connected (FC) layer can consume.
Anchor: after a conv and pool stack, the network has produced 2 feature maps, each 2×2:
fm1: fm2:
0.8 0.4 0.1 0.9
0.2 0.6 0.5 0.3Task: classify into 3 classes.
What Flattening Does
Flattening takes a tensor of shape (channels, height, width) — here (2, 2, 2) — and unrolls it into a single vector by reading off every value in a fixed order (channel by channel, row by row):
flatten(fm1, fm2) = [fm1₀₀, fm1₀₁, fm1₁₀, fm1₁₁, fm2₀₀, fm2₀₁, fm2₁₀, fm2₁₁]
Substituting the anchor values:
flatten = [0.8, 0.4, 0.2, 0.6, 0.1, 0.9, 0.5, 0.3]
8 values in, 8 values out — the numbers don't change, only the shape does. What's lost is the 2D layout: the FC layer that consumes this vector has no idea that 0.8 and 0.4 were neighbors in fm1, or that fm1 and fm2 came from different filters. It just sees 8 numbers.
The FC Classification Head
The flattened vector feeds a fully connected layer. With 8 inputs and 3 output classes, the layer needs a weight matrix W of shape (3, 8) — one row per class, one column per input — plus a bias vector b of length 3:
parameters = 8 × 3 + 3 = 27
The forward pass computes a logit per class: z = W·x + b, where x is the flattened vector.
Using the seeded random weights below (illustrative — in a trained network these come from backprop):
W (3×8):
[ 0.2484 -0.0691 0.3238 0.7615 -0.1171 -0.1171 0.7896 0.3837]
[-0.2347 0.2713 -0.2317 -0.2329 0.1210 -0.9566 -0.8625 -0.2811]
[-0.5064 0.1571 -0.4540 -0.7062 0.7328 -0.1129 0.0338 -0.7124]z₀ = 0.2484(0.8) + (-0.0691)(0.4) + 0.3238(0.2) + 0.7615(0.6) + (-0.1171)(0.1) + (-0.1171)(0.9) + 0.7896(0.5) + 0.3837(0.3) = 1.0856
z₁ = -1.6298, z₂ = -1.0819 (same dot-product pattern, row 2 and row 3 of W).
logits = [1.0856, -1.6298, -1.0819]
Softmax converts logits to probabilities: softmax(zᵢ) = eᶻⁱ / Σⱼeᶻʲ
probs = [0.8470, 0.0561, 0.0970] → predicted class = 0
Why FC Comes After Flatten
By the time feature maps reach the FC head, several conv+pool layers have already extracted and compressed the relevant patterns — the spatial dimensions are small (here, 2×2), and what's left is a compact summary of "which features fired, roughly where." The FC layer's job is not to look for spatial patterns anymore — convolution already did that — it's to combine the extracted features into class evidence. That combination step doesn't need 2D structure, it needs every feature to be able to influence every class, which is exactly what a fully connected layer's dense weight matrix does. Flatten is what makes that dense connection possible: a matrix multiply needs a vector, not a tensor.
Alternative: Global Average Pooling (GAP)
Instead of flatten → FC, GAP averages each feature map down to a single scalar, one per channel — no per-pixel weights required.
For the anchor:
GAP(fm1) = (0.8 + 0.4 + 0.2 + 0.6) / 4 = 2.0 / 4 = 0.5
GAP(fm2) = (0.1 + 0.9 + 0.5 + 0.3) / 4 = 1.8 / 4 = 0.45
GAP output = [0.5, 0.45] — 2 values instead of 8.
An FC layer on top of GAP output (2 → 3 classes) needs only 2 × 3 + 3 = 9 parameters, versus 27 for flatten + FC — a 3× reduction, and the gap widens further as feature maps grow (a 7×7×512 tensor, common near the end of a deep CNN, is 25,088 flattened values vs. 512 after GAP).
GAP is standard in ResNet and MobileNet, where it replaces flatten + FC entirely as the final step before the classifier.
Hyperparameter Sensitivity: Feature Map Size Before Flatten
The parameter cost of flatten + FC is driven entirely by the spatial size of the feature maps reaching the flatten step — not by anything about the FC layer itself. Holding channels at 2 and output classes at 3 (as in the anchor), varying the spatial size shows how fast flatten + FC parameters grow compared to GAP + FC:
import numpy as np
channels = 2
classes = 3
for spatial in [2, 4, 7, 14, 28]:
flat_dim = channels * spatial * spatial
flatten_params = flat_dim * classes + classes
gap_params = channels * classes + classes
print(f"{spatial}x{spatial}: flatten_dim={flat_dim:5d} "
f"flatten+FC={flatten_params:6d} GAP+FC={gap_params}")2x2: flatten_dim= 8 flatten+FC= 27 GAP+FC=9
4x4: flatten_dim= 32 flatten+FC= 99 GAP+FC=9
7x7: flatten_dim= 98 flatten+FC= 297 GAP+FC=9
14x14: flatten_dim= 392 flatten+FC= 1179 GAP+FC=9
28x28: flatten_dim= 1568 flatten+FC= 4707 GAP+FC=9Flatten + FC parameters scale with the square of the spatial dimension (doubling spatial size roughly quadruples the flatten+FC parameter count), while GAP + FC stays fixed at 9 regardless of spatial size — GAP's cost depends only on channel count and class count, never on spatial resolution. This is why deep CNNs pool aggressively before the classification head: every doubling of spatial size left unpooled multiplies the FC layer's weight count by four.
Code
import numpy as np
fm1 = np.array([[0.8,0.4],[0.2,0.6]])
fm2 = np.array([[0.1,0.9],[0.5,0.3]])
feature_maps = np.stack([fm1, fm2]) # shape (2, 2, 2)
# Flatten
flat = feature_maps.flatten()
print("Flattened:", np.round(flat, 2))
# FC layer (seeded weights for reproducibility)
np.random.seed(42)
W = np.random.randn(3, 8) * 0.5
b = np.zeros(3)
z = W @ flat + b
def softmax(z): e = np.exp(z-z.max()); return e/e.sum()
probs = softmax(z)
print("FC output (logits):", np.round(z, 4))
print("Class probabilities:", np.round(probs, 4))
print("Predicted class:", np.argmax(probs))
# GAP alternative
gap = feature_maps.mean(axis=(1,2))
print("\nGAP output:", np.round(gap, 4))Flattened: [0.8 0.4 0.2 0.6 0.1 0.9 0.5 0.3]
FC output (logits): [ 1.0856 -1.6298 -1.0819]
Class probabilities: [0.847 0.0561 0.097 ]
Predicted class: 0
GAP output: [0.5 0.45]Related Concepts
Where this builds from: Convolution (post 04) and pooling (post 07) produce the spatial feature map tensors that get flattened here — the fully connected mechanics themselves are the same dense layer covered in the ANN post (post 02, 03).
Where this leads: Flatten + FC (or GAP + FC) is the final stage of a complete CNN pipeline — the next post assembles convolution, pooling, and this classification head into a full worked example on RGB input. GAP specifically is the design choice behind ResNet's and MobileNet's final layers.
Honest Limitations
Flatten + a large FC layer is the primary source of overfitting in CNNs when feature maps are still large. A 7×7×512 tensor flattened to 25,088 values, feeding a 1000-class FC layer, needs over 25 million weights in that single layer alone — far more than the convolutional layers combined. With limited training data this overfits badly; GAP or aggressive dropout on the FC layer is the standard fix.
FC layers after flatten destroy spatial information — position no longer matters to the classifier, which is fine for "what class is this" but wrong for "where is it." If the task requires localization (object detection, segmentation), a flatten+FC head throws away exactly the information needed; those architectures keep spatial structure through the head instead (region proposals, fully convolutional output layers).
Flattening assumes a fixed input size. The FC layer's weight matrix has a fixed number of columns (one per flattened value), so a different input image size that changes the feature map's spatial dimensions breaks the matrix multiply. GAP sidesteps this — averaging works regardless of spatial size, which is part of why architectures using GAP can accept variable input resolutions.
Test Your Understanding
-
Why does flattening lose the 2D adjacency information (which values were neighbors), and why does the FC layer not need that information to make a classification decision?
-
Given a feature map stack of shape (4 channels, 3×3 spatial) feeding an FC layer with 5 output classes, compute the number of parameters for (a) flatten + FC and (b) GAP + FC. What is the ratio?
-
In the anchor's FC forward pass, z₀ = 1.0856 was the largest logit and led to the correct-looking prediction. If you scaled every weight in row 0 of W by 0.1 (making class 0's logit much smaller) while leaving rows 1 and 2 unchanged, would the predicted class necessarily change? Why or why not?
-
A network trained with flatten + FC on 128×128 images is given a 256×256 image at inference. What breaks, and why does the same network using GAP + FC not break in this scenario?
-
Suppose two different filters in fm1 and fm2 both detect "edges," but at different orientations, and their activations end up adjacent in the flattened vector purely by coincidence of channel order. Could the FC layer learn to treat them as related just because they're adjacent in the vector? What does this imply about whether flatten order matters for what the network can learn?