~/blog

CNN Example with RGB

Jul 3, 20269 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Every CNN example so far in this series has used a single-channel (grayscale) image. Real photos are RGB — three channels stacked together — and a filter that only knows how to slide over one 2D grid can't touch that directly. A 3×3 filter for an RGB image isn't three separate 3×3 filters, one per channel; it's a single 3×3×3 kernel that looks at all three channels at once and collapses them into one number per position. Understanding that collapse is the last piece needed before a full CNN pipeline makes sense end to end.

Anchor: a 4×4 RGB image (3 channels, values generated once and fixed for this post), used for a binary classification task.


RGB Input Tensor

Shape: 3×4×4 (channels × height × width, PyTorch convention). Each channel is its own 4×4 matrix:

text
R channel:                G channel:                 B channel:
0.4000 0.7020 0.3608 0.0549   0.3882 0.4039 0.5922 0.5098   0.7961 0.2235 0.0824 0.9882
0.4157 0.2784 0.7373 0.0784   0.5843 0.2039 0.0039 0.3412   0.9216 0.3451 0.1882 0.8549
0.4000 0.4745 0.8235 0.8392   0.9216 0.6157 0.1451 0.5059   0.2275 0.9961 0.6627 0.8588
0.2902 0.7922 0.3412 0.4549   0.7490 0.7333 0.0784 0.6275   0.7333 0.8118 0.0549 0.7412

Three independent grids, same spatial position across all three — pixel (0,0) has an R value, a G value, and a B value, together describing one point in the image.

4×4 RGB Image — 3 Stacked Channels R channel G channel B channel shape: 3×4×4 (C, H, W) Same spatial grid, 3 values per pixel position

Multi-Channel Convolution

One filter for a 3-channel input is one 3×3×3 kernel — 27 weights, not 3 separate 9-weight filters. The operation: slide the kernel's R slice over the R channel, the G slice over the G channel, the B slice over the B channel, sum all three per-channel results plus the products within each patch, and add them into a single scalar per position.

Position (0,0):

R_patch × K_R, summed: -0.1146 G_patch × K_G, summed: 0.0011 B_patch × K_B, summed: -0.1601

Output[0,0] = R_sum + G_sum + B_sum = -0.1146 + 0.0011 + (-0.1601) = -0.2736

(This filter uses no bias — bias = 0 for this example, so the sum of the three channel contributions is the final value.)

All 4 output positions (4×4 input, 3×3 kernel, stride 1, no padding → 2×2 output):

PositionR_sumG_sumB_sumOutput
(0,0)-0.11460.0011-0.1601-0.2736
(0,1)0.0330-0.05350.0026-0.0179
(1,0)0.10930.1516-0.05200.2089
(1,1)-0.0349-0.0837-0.0070-0.1256

Feature map:

text
-0.2736  -0.0179
 0.2089  -0.1256

One 3×3×3 filter, three input channels, one 2D feature map out — the channel dimension collapses entirely after a single conv layer.

3×3×3 Filter on 3-Channel Input → One 2×2 Feature Map R × K_R -0.1146 G × K_G 0.0011 B × K_B -0.1601 Σ → -0.2736 Output[0,0] Repeated for all 4 positions → 2×2 feature map -0.27 -0.02 0.21 -0.13

Multiple Filters with RGB

A real conv layer uses many filters, each a separate 3×3×3 kernel producing its own 2×2 feature map. With 8 filters:

parameters = 8 × (3×3×3) + 8 biases = 8 × 27 + 8 = 216 + 8 = 224

Output shape after conv: 2×2×8 — 8 stacked 2×2 feature maps, one per filter, each built the same way as above (3 channel slices → sum → one 2D map).

4×4×3 Input → Conv(8 filters, 3×3×3) → 2×2×8 Output Input 4×4×3 8 filters each 3×3×3 (27 wts) Output 2×2×8 Total params: 8×(3×3×3)+8 = 224

Full Pipeline

Input (4×4×3) → Conv(8 filters, 3×3) → ReLU → MaxPool(2×2) → Flatten → FC → Sigmoid (binary output)

StageShape
Input4×4×3
After Conv(8, 3×3)2×2×8
After ReLU2×2×8 (unchanged)
After MaxPool(2×2)1×1×8
After Flatten8
After FC(8→1) + Sigmoid1 (probability)

MaxPool with a 2×2 window on a 2×2 input collapses each feature map to its single maximum value — the pool window equals the entire feature map, so this is equivalent to taking one number per channel. Flatten then turns the 1×1×8 tensor into an 8-element vector, and a final FC + sigmoid layer (post 08's mechanics, output size 1 instead of 3) produces a single probability for binary classification.


Code

python
import numpy as np

np.random.seed(42)
# 4×4 RGB image (3×4×4 in PyTorch convention)
X = np.random.randint(0, 255, (3, 4, 4)).astype(float) / 255

# Single 3×3×3 filter
K = np.random.randn(3, 3, 3) * 0.1

# Multi-channel conv (stride=1, no padding → 2×2 output)
def conv_rgb(X, K):
    C, H, W = X.shape
    k = K.shape[1]
    out_h, out_w = H-k+1, W-k+1
    out = np.zeros((out_h, out_w))
    for c in range(C):
        for i in range(out_h):
            for j in range(out_w):
                out[i,j] += np.sum(X[c, i:i+k, j:j+k] * K[c])
    return out

feature_map = conv_rgb(X, K)
print("Input shape:", X.shape)
print("Filter shape:", K.shape)
print("Output feature map (2×2):")
print(np.round(feature_map, 4))
print("Output shape:", feature_map.shape)
text
Input shape: (3, 4, 4)
Filter shape: (3, 3, 3)
Output feature map (2×2):
[[-0.2736 -0.0179]
 [ 0.2089 -0.1256]]
Output shape: (2, 2)

Hyperparameter Sensitivity: Filter Count

The number of filters is the key hyperparameter in a multi-channel conv layer — it controls both parameter count and how many distinct local patterns the layer can represent, while the per-position arithmetic (channel slices summed to one scalar) stays identical regardless of filter count.

python
for num_filters in [1, 8, 16, 32, 64]:
    params = num_filters * (3 * 3 * 3) + num_filters
    print(f"filters={num_filters:>2} -> params={params:>5}, output shape=(2, 2, {num_filters})")
text
filters= 1 -> params=   28, output shape=(2, 2, 1)
filters= 8 -> params=  224, output shape=(2, 2, 8)
filters=16 -> params=  448, output shape=(2, 2, 16)
filters=32 -> params=  896, output shape=(2, 2, 32)
filters=64 -> params= 1792, output shape=(2, 2, 64)

Parameter count scales linearly with filter count (28 per filter: 27 weights + 1 bias) — doubling filters exactly doubles parameters, since each filter is an independent 3×3×3 kernel with no shared weights across filters. At the low extreme (1 filter), the layer collapses the RGB input to a single 2×2 map — it can detect exactly one pattern (e.g., "vertical edge in this color combination"), which is why real first layers never use fewer than a handful of filters. At the high extreme (64 filters on a 4×4×3 anchor with 48 input values total), the layer has 1,792 parameters fitting 48 numbers — heavily overparameterized for this toy input, though not a problem on real 224×224×3 images where the input has far more values than the filters have weights.


Where this builds from: RGB image representation (post 03) established that a color image is 3 stacked 2D grids. Single-channel convolution (post 04) established the sliding-window sum-of-products mechanic that this post extends across all 3 channels at once.

Where this leads: Real architectures — AlexNet, VGG, ResNet — start exactly here: an RGB input, a stack of multi-channel conv layers with increasing filter counts, pooling between stages, and a flatten+FC or GAP head (post 08) at the end. Batch normalization, typically inserted right after each conv layer in modern architectures, is the natural next addition once multi-channel convolution is understood.


Honest Limitations

One 3×3×3 filter has only 27 weights and a receptive field of 3×3 pixels — it cannot represent complex, large-scale patterns on its own. Detecting something like "a face" requires many conv layers stacked, each building on the previous layer's feature maps to grow the effective receptive field; a single conv layer only ever sees local pixel neighborhoods.

A single filter detects exactly one pattern. With only 8 filters, this layer captures at most 8 distinct local patterns (edges, color blobs, simple textures at this depth) — real first layers in trained networks typically use 32–64 filters, and only in early layers; capturing the diversity of patterns in natural images needs many filters across many layers, not more filters in one layer.

Channel ordering (RGB vs. BGR) differs between frameworks and libraries — OpenCV loads images as BGR by default while PIL and most deep learning frameworks expect RGB. Feeding a BGR-ordered tensor into a model trained on RGB silently swaps the R and B channel weights, producing a model that trains and predicts without erroring but performs worse than expected — a common, hard-to-spot bug.


Test Your Understanding

  1. Why is a filter for an RGB image a single 3×3×3 kernel rather than three separate 3×3 filters — one applied to each channel? What would change about the output shape if three separate filters were used and never summed?

  2. Given the anchor's parameter count (224 for 8 filters), compute the parameter count for a second conv layer that takes the 2×2×8 output as input and applies 16 filters of size 3×3. (Careful: the input now has 8 channels, not 3.)

  3. In the position (0,0) computation, the R, G, and B sums were -0.1146, 0.0011, and -0.1601. If the B channel's kernel slice K_B were scaled by 10×, would the output feature map still meaningfully use information from the R and G channels? What does this imply about a filter's learned emphasis across channels?

  4. The full pipeline collapses a 4×4 input all the way to 1×1×8 after a single pool layer, because the feature map was already only 2×2. For a more realistic 224×224×3 input using the same conv+pool pattern, would one conv+pool stage be enough to reach a 1×1 spatial size? Roughly how many conv+pool stages would be needed, assuming each stage roughly halves spatial dimensions?

  5. A colleague loads training images with OpenCV (BGR order) but evaluates the trained model on images loaded with PIL (RGB order). The model's training accuracy was high but validation accuracy on the PIL-loaded images is much worse. Using what you know about multi-channel convolution, explain why this channel-order mismatch degrades performance instead of causing an outright shape error.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment