~/blog
Human Brain vs CNN
The first CNNs were not designed from abstract principles. They were designed by looking at how the mammalian visual system processes information, then reproducing that structure in silicon. Understanding where CNNs came from reveals why their specific design choices work, and where they still fall short.
Anchor: recognizing a handwritten digit "7" in a 28×28 image.
How the Visual Cortex Processes an Image
When you see a "7", light hits your retina and triggers a cascade through at least four distinct processing stages in your visual cortex:
V1 (Primary Visual Cortex): Simple cells respond to edges at specific orientations — a vertical edge, a 45° edge, a horizontal edge. Complex cells respond to the same edges regardless of exact position in the visual field. These are the biological counterparts of CNN filters in the first convolutional layer.
V2: Processes more complex features — corners, curves, texture borders. Combines edge information from V1. Analogous to a second convolutional layer.
V4: Shape and color selectivity. Neurons respond to specific curved forms and combinations of edges. Analogous to deeper convolutional layers with larger receptive fields.
IT (Inferotemporal Cortex): Object-level representations. "Face cells" respond only to faces regardless of size, rotation, or position. This is the biological analog of the CNN's final feature maps and fully connected classification head.
Hubel and Wiesel's Discovery (1959)
David Hubel and Torsten Wiesel won the Nobel Prize in 1981 for discovering that V1 neurons in cats responded to oriented edges, not whole objects. A neuron that fired to a vertical edge at 10° did not fire to a vertical edge at 0°. Some neurons fired to an edge anywhere in their receptive field (complex cells); others fired only when the edge was at a specific location (simple cells).
Yann LeCun's LeNet (1989) was the first CNN. The architecture directly mapped this finding: early layers learn orientation-sensitive filters (like simple cells), pooling layers add spatial tolerance (like complex cells), and deeper layers combine those into higher-level features.
Learned Filters Visualization
In a trained CNN, the first convolutional layer's 64 filters look like what Hubel and Wiesel observed in V1 — Gabor-like patterns: edges, blobs, and color patches at various orientations. Each row below represents one learned 3×3 filter shown as its weight pattern:
Brain vs CNN — Comparison Table
| Property | Human Visual System | CNN |
|---|---|---|
| Processing stages | V1 → V2 → V4 → IT (~4 stages) | Conv layers (typically 5–50) |
| Feature hierarchy | Edges → corners → shapes → objects | Filters → textures → parts → objects |
| Parameter sharing | Complex cells respond to edges anywhere in receptive field | Same filter weights applied across all spatial positions |
| Top-down feedback | Extensive — higher areas modulate lower area processing | None — purely feedforward |
| Training signal | Evolutionary + lifetime experience | Labeled examples (supervised) or self-supervised |
| Robustness to perturbation | Handles noise, occlusion, viewpoint changes | Brittle to adversarial perturbations |
| Generalization from examples | ~10–50 examples (one-shot learning possible) | Typically 1,000–100,000 examples per class |
Three Insights CNNs Took from Neuroscience
1. Local receptive fields. V1 neurons respond only to a small patch of the visual field. CNNs use small filters (3×3, 5×5) — each output neuron sees only a local patch of the input. This biologically motivated design choice turned out to be computationally optimal for images.
2. Hierarchical feature composition. Simple features (V1 edges) combine into complex features (V4 shapes). CNNs stack conv layers to achieve the same effect: early layers detect edges, later layers detect object parts. This was not obvious from first principles — it came from studying the brain.
3. Translation tolerance. Complex cells in V1 fire regardless of where in the receptive field the edge appears. Pooling layers in CNNs serve the same function: a feature detected in the top-left of a pool window produces the same output as the same feature in the bottom-right.
What CNNs Don't Have (But the Brain Does)
Top-down feedback. The brain sends signals from higher areas back to lower areas — IT cortex influences V1. When you look for a red apple, your brain pre-activates V4 color detectors. CNNs are purely feedforward: data flows in one direction, from input to output.
Dynamic computation. The brain decides when it has enough evidence, spending more computation on ambiguous inputs. CNNs apply exactly the same number of operations to every input, whether it's a clear day scene or a blurry edge.
One-shot learning. A child can recognize a new animal from 3 examples. Vanilla CNNs need thousands. Neuroscience-inspired approaches (memory-augmented networks, meta-learning) are active research areas.
Code
import numpy as np
# Visualize 4 hand-coded filters from Hubel-Wiesel discovery
filters = {
"Horizontal edge (+/−)": np.array([[ 1, 1, 1],
[ 0, 0, 0],
[-1,-1,-1]]),
"Vertical edge (+/−)": np.array([[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]]),
"45° diagonal": np.array([[ 0,-1,-1],
[ 1, 0,-1],
[ 1, 1, 0]]),
"Center blob": np.array([[-1,-1,-1],
[-1, 8,-1],
[-1,-1,-1]]),
}
# Anchor: simplified 7-digit row in a 5×5 patch
patch = np.array([
[255, 255, 255, 255, 255],
[ 0, 0, 0, 255, 0],
[ 0, 0, 255, 0, 0],
[ 0, 255, 0, 0, 0],
[255, 0, 0, 0, 0],
], dtype=float)
for name, f in filters.items():
# Manually apply filter at top-left corner
response = np.sum(patch[:3, :3] * f)
print(f"{name}: response at (0,0) = {response:.0f}")Horizontal edge (+/−): response at (0,0) = 0
Vertical edge (+/−): response at (0,0) = -765
45° diagonal: response at (0,0) = -255
Center blob: response at (0,0) = 1020The horizontal edge filter returns 0 at (0,0) because the top 3 rows of the patch are all bright (255) — no horizontal edge there. The vertical edge filter returns −765 because the left columns are dark and the right column is bright — a vertical edge.
Related Concepts
Where this builds from: The CNN introduction (previous post) motivated CNNs as parameter-efficient spatial learners. This post explains why the specific CNN design works — because it mirrors biological vision processing.
Where this leads: The convolution operation (next two posts) mechanically implements what V1 does biologically — sliding a filter (like a simple cell's receptive field) across the entire image. The hierarchy of features becomes concrete once you understand how multiple stacked conv layers combine to build up representations.
Honest Limitations
The brain analogy is a motivation, not a proof. CNNs were inspired by neuroscience, but decades of research have shown they diverge significantly from biological vision in their failure modes. CNNs are brittle to adversarial examples (adding imperceptible pixel noise flips predictions) — the human visual system is not. The neuroscience analogy explains the architecture but does not guarantee biological fidelity.
CNNs have no top-down feedback, which limits compositional reasoning. A CNN viewing a partially occluded object cannot use its knowledge of the whole to fill in the missing parts. Humans do this effortlessly. Capsule networks and attention mechanisms are research directions aimed at closing this gap.
The feature hierarchy metaphor breaks down in very deep networks. In a 152-layer ResNet, layer 100's features are not interpretable as named visual concepts like "corners" or "curves." The hierarchy metaphor holds for the first 5–10 layers. Beyond that, the representations become abstract and difficult to map to neuroscientific equivalents.
Test Your Understanding
-
Hubel and Wiesel found that simple cells respond to edges at specific orientations and specific locations, while complex cells respond to edges at specific orientations but any location in the receptive field. Which CNN mechanism corresponds to simple cells? Which corresponds to complex cells?
-
A trained CNN's first layer has 64 filters. After training on ImageNet (1.2M natural images), these filters tend to look like oriented edge detectors and color patches. If you train the same CNN on random noise images (no structure), what would you expect the learned filters to look like, and why?
-
The human visual system uses top-down feedback — the brain can direct attention to specific features. In the digit "7" recognition task, how might top-down feedback help with recognizing a partially occluded "7" (only the top stroke is visible)? Would a standard CNN succeed or fail at this? Why?
-
A CNN trained on upright faces fails on upside-down faces, even though a human adapts within milliseconds. Explain this failure in terms of (a) the CNN's feature hierarchy and (b) the absence of top-down feedback.
-
The brain's V1 and V2 stages together take ~100ms to process a visual scene. A 50-layer ResNet on a GPU can process an image in ~10ms. Despite the speed advantage, CNNs require far more labeled training data than humans do. Propose one specific architecture change or training procedure that might reduce this data requirement, citing which aspect of biological visual learning it would mimic.