~/blog

CNN vs ANN Operations

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The difference between an ANN and a CNN is not just "one is for images." They are fundamentally different computational structures. Understanding why CNN operations are efficient — and when they are not — determines which architecture to choose for a given task.

Anchor: classify a 4×4 grayscale image (16 pixels) into 2 classes.


ANN Parameter Count

ANN with 1 hidden layer: 16 inputs → 8 neurons → 2 outputs.

Hidden layer (W1): 16 inputs × 8 neurons = 128 weights + 8 biases = 136 parameters

Output layer (W2): 8 inputs × 2 outputs = 16 weights + 2 biases = 18 parameters

Total: 154 parameters

Every input pixel connects to every hidden neuron. No two connections share a weight. The network must independently learn that a bright pixel in the top-left corner and a bright pixel in the bottom-right corner are both evidence of the same feature.

ANN: All Connections Unique vs CNN: Filter Shared Across Positions ANN — 154 parameters ...16 inputs ...8 neurons 128 unique weights (all dashed lines) CNN — 38 parameters 4×4 input 3×3 patch same filter 3×3×2 = 18W same weights 2×2×2 feature maps FC 18W 2 outputs Total: Conv 20 + FC 18 = 38 parameters (4× fewer)

CNN Parameter Count

CNN: one 3×3 conv with 2 filters → pool → FC.

Conv layer: One 3×3 filter has 3×3 = 9 weights. With 2 filters: 9×2 = 18 weights + 2 biases = 20 parameters

The filter is applied at every position. The same 9 weights are reused across all 4 positions (for a 4×4 input with 3×3 filter, no padding → 2×2 output). This is parameter sharing.

After conv (2×2×2) → MaxPool(2×2) → 1 value per filter: After pooling, we have 2 values (one per filter). If we skip the pooling and flatten the 2×2×2 conv output: 8 values.

FC layer: 8 inputs × 2 outputs = 16 weights + 2 biases = 18 parameters

Total: 20 + 18 = 38 parameters

Same task. 4× fewer parameters. 154 vs 38.


Why Parameter Sharing is Powerful

Scale this to a real image:

ANN first layer, 224×224 RGB → 100 neurons: 224 × 224 × 3 = 150,528 inputs × 100 neurons = 15,052,800 weights + 100 biases

CNN first layer, 64 filters of size 3×3×3: 3 × 3 × 3 × 64 = 1,728 weights + 64 biases

The CNN needs 8,700× fewer parameters in the first layer. Each filter detects the same feature (e.g., a vertical edge) everywhere in the image. The ANN must independently learn that a vertical edge at pixel (0,0) and a vertical edge at pixel (200,200) are the same thing.


Key Operational Differences

OperationANNCNN
Layer typeDense (fully connected)Convolutional + pooling
Input treatmentFlattened 1D vectorSpatial tensor (H×W×C)
Weight reuseEach weight used onceFilter reused at every position
Spatial structureIgnoredPreserved and exploited
Inductive biasNoneTranslation invariance
Parameter growth with image sizeO(n²) — quadraticO(k² · channels) — constant

"O(n²) parameter growth" means: double the image size → 4× the parameters for an ANN. For a CNN: the filter size is fixed — doubling the image size has no effect on parameter count. Only the number of convolution computations grows.


When to Use ANN vs CNN

Use ANN for tabular data — age, salary, credit score, medical measurements. These features have no meaningful spatial relationship. Pixel (row=5, col=3) of an image is the neighbor of pixel (row=5, col=4). But feature column 5 of a tabular dataset is not spatially related to feature column 4.

Use CNN for images, audio spectrograms, and any data where nearby inputs share meaningful structure. 1D CNNs work for time series. 2D CNNs work for images and spectrograms.

Hybrid: CNN for feature extraction (conv layers learn spatial features), ANN (fully connected layers) for classification head. This is the standard architecture for image classifiers: VGG, ResNet, EfficientNet all use this pattern.


Code

python
import numpy as np

def ann_params(input_size, hidden, output):
    w1 = input_size * hidden
    b1 = hidden
    w2 = hidden * output
    b2 = output
    total = w1 + b1 + w2 + b2
    print(f"  W1: {input_size}×{hidden} = {w1}, b1: {b1} → {w1+b1}")
    print(f"  W2: {hidden}×{output} = {w2}, b2: {b2} → {w2+b2}")
    print(f"  Total: {total}")
    return total

def cnn_params(k, in_ch, n_filters, fc_in, output):
    conv = k * k * in_ch * n_filters + n_filters
    fc = fc_in * output + output
    total = conv + fc
    print(f"  Conv: {k}×{k}×{in_ch}×{n_filters} = {k*k*in_ch*n_filters} weights + {n_filters} biases = {conv}")
    print(f"  FC: {fc_in}×{output} = {fc_in*output} weights + {output} biases = {fc}")
    print(f"  Total: {total}")
    return total

img_size = 16  # 4×4 flattened
print("ANN (16 → 8 → 2):")
ann = ann_params(img_size, 8, 2)

print("\nCNN (3×3 conv, 2 filters, 2×2×2 → FC → 2):")
cnn = cnn_params(k=3, in_ch=1, n_filters=2, fc_in=8, output=2)

print(f"\nReduction: {ann/cnn:.1f}×")

# Scale to real image
print("\n--- Scale to 224×224 RGB ---")
print("ANN first layer (150528 → 100):")
ann_real = 150528 * 100 + 100
print(f"  {150528 * 100:,} weights + 100 = {ann_real:,}")
print("CNN first layer (64 filters, 3×3×3):")
cnn_real = 3*3*3*64 + 64
print(f"  3×3×3×64 = {cnn_real:,} weights")
print(f"  Reduction: {ann_real/cnn_real:.0f}×")
text
ANN (16 → 8 → 2):
  W1: 16×8 = 128, b1: 8 → 136
  W2: 8×2 = 16, b2: 2 → 18
  Total: 154

CNN (3×3 conv, 2 filters, 2×2×2 → FC → 2):
  Conv: 3×3×1×2 = 18 weights + 2 biases = 20
  FC: 8×2 = 16 weights + 2 biases = 18
  Total: 38

Reduction: 4.1×

--- Scale to 224×224 RGB ---
ANN first layer (150528 → 100):
  15,052,800 weights + 100 = 15,052,900
CNN first layer (64 filters, 3×3×3):
  1,792 weights
  Reduction: 8404×

Where this builds from: ANN architecture (section 2, post 03) introduced fully connected layers. The convolution operation (post 04) is the mechanism that makes parameter sharing possible.

Where this leads: Pooling (next post) reduces spatial dimensions in the CNN, further reducing the parameter count before the FC layers. The full pipeline (post 09) with an RGB example shows all these components working together.


Honest Limitations

CNN's inductive bias — translation invariance — can hurt on non-spatial data. Applying a CNN to tabular data treats adjacent columns as spatially related, which they may not be. A feature at column 5 of a customer record is not the "neighbor" of column 4 in any meaningful sense. ANNs are better for tabular data.

Parameter sharing requires translation invariance to actually hold. Faces in passport photos must be upright and centered — the assumption fails. If the task requires spatial precision ("where is the car's license plate in this image?"), translation invariance becomes a limitation.

CNNs require more architecture decisions than ANNs. Number of filters, filter size, stride, padding, pooling type — each is a hyperparameter. ANNs only require the number of layers and neurons per layer. The richer structure of CNNs makes them more powerful but also more complex to tune.


Test Your Understanding

  1. An ANN with architecture 64 → 32 → 16 → 2 is used to classify 8×8 grayscale images. Compute the total parameter count. Now design a CNN with: one 3×3 conv layer with 8 filters, followed by 2×2 max pooling, followed by a fully connected layer to 2 outputs. Compute the CNN's total parameter count and the reduction ratio.

  2. A CNN filter learns to detect diagonal edges (top-left to bottom-right). Because of parameter sharing, this filter will detect the same diagonal edge anywhere in the image. An ANN, by contrast, would need separate weights to detect the same feature at each position. For a 10×10 image, how many separate weights would an ANN first layer need to encode this one filter's functionality?

  3. You are building a model to classify tabular data with 20 features (age, income, credit score, etc.). A colleague suggests using a 1D CNN. What is the main objection to this approach? Under what conditions might a 1D CNN still be appropriate for tabular data?

  4. ResNet-50 has ~25 million parameters but achieves 76% ImageNet accuracy. A vanilla ANN with the same number of parameters (say, 224×224×3 → hidden layers → 1000 outputs) would likely achieve much lower accuracy. Why? Name at least two reasons specific to the CNN architecture that explain this performance gap.

  5. A CNN's O(k² × channels) parameter count is independent of image size. But the computation (number of multiply-adds per forward pass) does scale with image size. For a 3×3 conv with 64 filters processing a 224×224 RGB image, compute the number of multiply-adds per filter at each position, and the total multiply-adds for the entire layer.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment