~/blog

CNN Introduction

Jul 1, 2026•8 min read•By Mohammed Vasim

deep-learningneural-networksmachine-learningrepresentation-learning

A standard ANN with 784 inputs (28×28 image) and one 100-neuron hidden layer has 78,400 parameters. A 224×224 RGB image has 150,528 pixel values — the first hidden layer alone would need 15 million parameters. And ANNs completely ignore spatial structure: pixel (0,0) and pixel (223,223) are treated as independent, unrelated features, even though pixels in the same region of an image are highly correlated.

CNNs solve both problems at once. They share parameters across the image (reducing parameter count by orders of magnitude) and exploit spatial locality (nearby pixels are treated as related).

Anchor: 6×6 grayscale image. Task: detect whether the image contains a horizontal edge.

text

Image (6×6 pixels):
  0   0   0   0   0   0
  0   0   0   0   0   0
255 255 255 255 255 255
255 255 255 255 255 255
  0   0   0   0   0   0
  0   0   0   0   0   0

Rows 3 and 4 are bright (255); rows 1–2 and 5–6 are dark (0). A clear horizontal edge at the transition.

The Three Key Ideas

1. Local connectivity. Each output neuron sees only a small region (patch) of the input — called the receptive field. A 3×3 kernel looks at 9 pixels at a time, not all 36.

2. Parameter sharing. The same filter weights are used across the entire image. The horizontal edge detector kernel applied at position (0,0) is identical to the one applied at position (3,3). This is why the CNN needs only ~27 parameters for a 3×3 filter with 3 channels, not 36 unique weights for each image position.

3. Translation invariance. A horizontal edge at the top of the image and one at the bottom activate the same filter. The model doesn't need to learn "horizontal edge at position 3" and "horizontal edge at position 5" as separate concepts.

Architecture Overview

For the 6×6 anchor image:

Step	Operation	Output shape
Input	Raw pixels	6×6×1
Conv (3×3 filter, 1 filter)	Slide filter, multiply-sum	4×4×1
ReLU	Element-wise max(0, z)	4×4×1
MaxPool (2×2, stride=2)	Take max in each 2×2 region	2×2×1
Flatten	Reshape to vector	4
FC + Softmax	Dense layer → class scores	2 (classes)

What a Filter Learns

The horizontal edge detector kernel is not hand-coded — in a real CNN, filters like this are learned from data during training. But we can show what a horizontal edge detector looks like:

text

Horizontal edge detector kernel (3×3):
 [[ 1,  1,  1],
  [ 0,  0,  0],
  [-1, -1, -1]]

Applied to the anchor image: the top of the bright-to-dark transition produces a large positive response. The region of uniform color produces 0. The dark-to-bright transition produces a large negative response.

Early CNN layers learn kernels that look like this (edge detectors, color detectors). Middle layers combine those into textures and corners. Deep layers combine those into parts, then objects. The hierarchy is emergent — not programmed.

ANN vs CNN — Properties

Property	ANN	CNN
Parameters for 224×224 RGB → 100 neurons	15,052,800	~864 (64 filters, 3×3×3)
Spatial awareness	None	Yes — local receptive field
Parameter reuse	No	Yes — same filter across all positions
Translation invariant	No	Yes — via parameter sharing
Best for	Tabular data	Images and spatial data

Code

python

import numpy as np

# Anchor 6×6 image
img = np.array([
    [  0,   0,   0,   0,   0,   0],
    [  0,   0,   0,   0,   0,   0],
    [255, 255, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 255],
    [  0,   0,   0,   0,   0,   0],
    [  0,   0,   0,   0,   0,   0],
], dtype=float)

# Horizontal edge detector
kernel = np.array([
    [ 1,  1,  1],
    [ 0,  0,  0],
    [-1, -1, -1],
], dtype=float)

# Manual convolution (no padding, stride=1) → 4×4 output
n, k = 6, 3
out = np.zeros((n-k+1, n-k+1))
for i in range(n-k+1):
    for j in range(n-k+1):
        out[i, j] = np.sum(img[i:i+k, j:j+k] * kernel)

print("Convolution output (4×4):")
print(out)

text

Convolution output (4×4):
[[    0.     0.     0.     0.]
 [-765. -765. -765. -765.]
 [ 765.  765.  765.  765.]
 [    0.     0.     0.     0.]]

Row 2 (−765): bright-to-dark transition going down — the kernel's +1 row is on bright pixels, −1 row is on dark. Row 3 (+765): dark-to-bright transition. Rows 1 and 4: uniform regions → zero response. The filter has successfully detected both edges.

Where this builds from: The ANN (section 2, post 03) introduced fully connected layers. CNNs replace those layers with convolutional layers for the feature extraction part, keeping the fully connected layers only at the end. The parameter explosion with images is the motivating problem that CNNs solve.

Where this leads: The convolution operation in detail (next few posts) — how the sliding window works, the output size formula, padding, stride. Then pooling and the full pipeline (post 09) with an RGB example.

Honest Limitations

CNNs require large labeled image datasets. Without transfer learning, CNNs perform poorly with fewer than ~1,000 labeled images per class. The filters need enough training signal to learn meaningful representations. Pre-trained models (ImageNet weights) allow effective transfer to small datasets.

CNNs are not suitable for non-spatial data. Tabular data (age, salary, credit score) has no meaningful notion of "nearby features." Applying a convolution filter to tabular data treats adjacent columns as correlated, which they may not be. ANNs or gradient boosting are better suited.

Test Your Understanding

A 100×100 grayscale image fed into an ANN with one 200-neuron hidden layer requires how many parameters in the first weight matrix? A CNN with one 3×3 filter requires how many parameters for the same task? What is the ratio?
The horizontal edge detector kernel [[1,1,1],[0,0,0],[−1,−1,−1]] applied to the anchor image gives −765 at the bright-to-dark transition and +765 at dark-to-bright. What would a vertical edge detector kernel look like? Apply it to column 0 of the image (all uniform values) — what is the output?
Translation invariance means the same filter detects the same feature regardless of where it appears in the image. Name one vision task where translation invariance is actually unwanted — where the position of a feature matters for the task output.
An ANN with 784 inputs (28×28 MNIST image) is trained to classify digits. The model achieves 97% accuracy. If you tile the same digit at a different position in a 56×56 image, the ANN will likely fail. Explain precisely why, using the definition of translation invariance.
A 3×3 filter applied to a 6×6 image without padding produces a 4×4 output. How many times is each input pixel used in the convolution? Count for a corner pixel, an edge pixel, and a center pixel. What does this asymmetry mean for what the model learns about image borders?

CNN Introduction

The Three Key Ideas

Architecture Overview

What a Filter Learns

ANN vs CNN — Properties

Code

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment

CNN Introduction

The Three Key Ideas

Architecture Overview

What a Filter Learns

ANN vs CNN — Properties

Code

Related Concepts

Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment