~/blog
CNN vs ANN Operations
The difference between an ANN and a CNN is not just "one is for images." They are fundamentally different computational structures. Understanding why CNN operations are efficient — and when they are not — determines which architecture to choose for a given task.
Anchor: classify a 4×4 grayscale image (16 pixels) into 2 classes.
ANN Parameter Count
ANN with 1 hidden layer: 16 inputs → 8 neurons → 2 outputs.
Hidden layer (W1): 16 inputs × 8 neurons = 128 weights + 8 biases = 136 parameters
Output layer (W2): 8 inputs × 2 outputs = 16 weights + 2 biases = 18 parameters
Total: 154 parameters
Every input pixel connects to every hidden neuron. No two connections share a weight. The network must independently learn that a bright pixel in the top-left corner and a bright pixel in the bottom-right corner are both evidence of the same feature.
CNN Parameter Count
CNN: one 3×3 conv with 2 filters → pool → FC.
Conv layer: One 3×3 filter has 3×3 = 9 weights. With 2 filters: 9×2 = 18 weights + 2 biases = 20 parameters
The filter is applied at every position. The same 9 weights are reused across all 4 positions (for a 4×4 input with 3×3 filter, no padding → 2×2 output). This is parameter sharing.
After conv (2×2×2) → MaxPool(2×2) → 1 value per filter: After pooling, we have 2 values (one per filter). If we skip the pooling and flatten the 2×2×2 conv output: 8 values.
FC layer: 8 inputs × 2 outputs = 16 weights + 2 biases = 18 parameters
Total: 20 + 18 = 38 parameters
Same task. 4× fewer parameters. 154 vs 38.
Why Parameter Sharing is Powerful
Scale this to a real image:
ANN first layer, 224×224 RGB → 100 neurons: 224 × 224 × 3 = 150,528 inputs × 100 neurons = 15,052,800 weights + 100 biases
CNN first layer, 64 filters of size 3×3×3: 3 × 3 × 3 × 64 = 1,728 weights + 64 biases
The CNN needs 8,700× fewer parameters in the first layer. Each filter detects the same feature (e.g., a vertical edge) everywhere in the image. The ANN must independently learn that a vertical edge at pixel (0,0) and a vertical edge at pixel (200,200) are the same thing.
Key Operational Differences
| Operation | ANN | CNN |
|---|---|---|
| Layer type | Dense (fully connected) | Convolutional + pooling |
| Input treatment | Flattened 1D vector | Spatial tensor (H×W×C) |
| Weight reuse | Each weight used once | Filter reused at every position |
| Spatial structure | Ignored | Preserved and exploited |
| Inductive bias | None | Translation invariance |
| Parameter growth with image size | O(n²) — quadratic | O(k² · channels) — constant |
"O(n²) parameter growth" means: double the image size → 4× the parameters for an ANN. For a CNN: the filter size is fixed — doubling the image size has no effect on parameter count. Only the number of convolution computations grows.
When to Use ANN vs CNN
Use ANN for tabular data — age, salary, credit score, medical measurements. These features have no meaningful spatial relationship. Pixel (row=5, col=3) of an image is the neighbor of pixel (row=5, col=4). But feature column 5 of a tabular dataset is not spatially related to feature column 4.
Use CNN for images, audio spectrograms, and any data where nearby inputs share meaningful structure. 1D CNNs work for time series. 2D CNNs work for images and spectrograms.
Hybrid: CNN for feature extraction (conv layers learn spatial features), ANN (fully connected layers) for classification head. This is the standard architecture for image classifiers: VGG, ResNet, EfficientNet all use this pattern.
Code
import numpy as np
def ann_params(input_size, hidden, output):
w1 = input_size * hidden
b1 = hidden
w2 = hidden * output
b2 = output
total = w1 + b1 + w2 + b2
print(f" W1: {input_size}×{hidden} = {w1}, b1: {b1} → {w1+b1}")
print(f" W2: {hidden}×{output} = {w2}, b2: {b2} → {w2+b2}")
print(f" Total: {total}")
return total
def cnn_params(k, in_ch, n_filters, fc_in, output):
conv = k * k * in_ch * n_filters + n_filters
fc = fc_in * output + output
total = conv + fc
print(f" Conv: {k}×{k}×{in_ch}×{n_filters} = {k*k*in_ch*n_filters} weights + {n_filters} biases = {conv}")
print(f" FC: {fc_in}×{output} = {fc_in*output} weights + {output} biases = {fc}")
print(f" Total: {total}")
return total
img_size = 16 # 4×4 flattened
print("ANN (16 → 8 → 2):")
ann = ann_params(img_size, 8, 2)
print("\nCNN (3×3 conv, 2 filters, 2×2×2 → FC → 2):")
cnn = cnn_params(k=3, in_ch=1, n_filters=2, fc_in=8, output=2)
print(f"\nReduction: {ann/cnn:.1f}×")
# Scale to real image
print("\n--- Scale to 224×224 RGB ---")
print("ANN first layer (150528 → 100):")
ann_real = 150528 * 100 + 100
print(f" {150528 * 100:,} weights + 100 = {ann_real:,}")
print("CNN first layer (64 filters, 3×3×3):")
cnn_real = 3*3*3*64 + 64
print(f" 3×3×3×64 = {cnn_real:,} weights")
print(f" Reduction: {ann_real/cnn_real:.0f}×")ANN (16 → 8 → 2):
W1: 16×8 = 128, b1: 8 → 136
W2: 8×2 = 16, b2: 2 → 18
Total: 154
CNN (3×3 conv, 2 filters, 2×2×2 → FC → 2):
Conv: 3×3×1×2 = 18 weights + 2 biases = 20
FC: 8×2 = 16 weights + 2 biases = 18
Total: 38
Reduction: 4.1×
--- Scale to 224×224 RGB ---
ANN first layer (150528 → 100):
15,052,800 weights + 100 = 15,052,900
CNN first layer (64 filters, 3×3×3):
1,792 weights
Reduction: 8404×Related Concepts
Where this builds from: ANN architecture (section 2, post 03) introduced fully connected layers. The convolution operation (post 04) is the mechanism that makes parameter sharing possible.
Where this leads: Pooling (next post) reduces spatial dimensions in the CNN, further reducing the parameter count before the FC layers. The full pipeline (post 09) with an RGB example shows all these components working together.
Honest Limitations
CNN's inductive bias — translation invariance — can hurt on non-spatial data. Applying a CNN to tabular data treats adjacent columns as spatially related, which they may not be. A feature at column 5 of a customer record is not the "neighbor" of column 4 in any meaningful sense. ANNs are better for tabular data.
Parameter sharing requires translation invariance to actually hold. Faces in passport photos must be upright and centered — the assumption fails. If the task requires spatial precision ("where is the car's license plate in this image?"), translation invariance becomes a limitation.
CNNs require more architecture decisions than ANNs. Number of filters, filter size, stride, padding, pooling type — each is a hyperparameter. ANNs only require the number of layers and neurons per layer. The richer structure of CNNs makes them more powerful but also more complex to tune.
Test Your Understanding
-
An ANN with architecture 64 → 32 → 16 → 2 is used to classify 8×8 grayscale images. Compute the total parameter count. Now design a CNN with: one 3×3 conv layer with 8 filters, followed by 2×2 max pooling, followed by a fully connected layer to 2 outputs. Compute the CNN's total parameter count and the reduction ratio.
-
A CNN filter learns to detect diagonal edges (top-left to bottom-right). Because of parameter sharing, this filter will detect the same diagonal edge anywhere in the image. An ANN, by contrast, would need separate weights to detect the same feature at each position. For a 10×10 image, how many separate weights would an ANN first layer need to encode this one filter's functionality?
-
You are building a model to classify tabular data with 20 features (age, income, credit score, etc.). A colleague suggests using a 1D CNN. What is the main objection to this approach? Under what conditions might a 1D CNN still be appropriate for tabular data?
-
ResNet-50 has ~25 million parameters but achieves 76% ImageNet accuracy. A vanilla ANN with the same number of parameters (say, 224×224×3 → hidden layers → 1000 outputs) would likely achieve much lower accuracy. Why? Name at least two reasons specific to the CNN architecture that explain this performance gap.
-
A CNN's O(k² × channels) parameter count is independent of image size. But the computation (number of multiply-adds per forward pass) does scale with image size. For a 3×3 conv with 64 filters processing a 224×224 RGB image, compute the number of multiply-adds per filter at each position, and the total multiply-adds for the entire layer.