~/blog
All You Need to Know About Images
Before writing a single line of CNN code, you need to understand exactly what data a CNN receives. An image is not a picture — it is a tensor. Every pixel is a number. Every image is a 3D array. Every batch of images is a 4D array. Misunderstanding the shape or scale of this data is the source of a surprising number of bugs in production computer vision code.
Anchor: 4×4 grayscale image of a simplified digit.
Pixel values (0 = black, 255 = white):
0 0 255 255
0 0 0 255
0 0 255 0
0 255 0 0Grayscale: One Channel
A grayscale image is a 2D matrix. Each entry is an integer between 0 (black) and 255 (white). The anchor is a 4×4 matrix with dtype uint8 (unsigned 8-bit integer, range 0–255).
Shape: (H, W) = (4, 4) — height × width.
When fed to a CNN, this becomes a 3D tensor by adding the channel dimension: (H, W, C) = (4, 4, 1) in TensorFlow convention, or (C, H, W) = (1, 4, 4) in PyTorch convention.
RGB: Three Channels
A color image has 3 channels: Red, Green, Blue. Each channel is a full H×W matrix of pixel values.
For a 4×4 RGB image:
R channel (reds): G channel (greens): B channel (blues):
20 60 180 255 10 40 120 200 5 20 80 150
5 15 30 200 3 10 20 150 2 5 10 100
3 10 200 20 2 5 140 10 1 2 90 5
2 200 10 10 1 130 5 5 1 70 3 3Shape: (H, W, C) = (4, 4, 3) in TensorFlow format, (C, H, W) = (3, 4, 4) in PyTorch format.
The full 4×4 RGB tensor contains 4 × 4 × 3 = 48 values — three 4×4 matrices stacked.
Normalization
Raw pixel values (0–255) are too large for neural network training. Large input values produce large activations and large gradients, making training unstable.
Simple normalization: divide by 255 → values in [0, 1].
Pixel value 200 → 200/255 = 0.784
ImageNet normalization (per-channel mean subtraction and std scaling):
ImageNet statistics computed over 1.2M training images:
- R channel: mean = 0.485, std = 0.229
- G channel: mean = 0.456, std = 0.224
- B channel: mean = 0.406, std = 0.225
For a pixel with raw value 200 in the R channel:
Step 1: divide by 255 → 200/255 = 0.784
Step 2: subtract mean → 0.784 − 0.485 = 0.299
Step 3: divide by std → 0.299 / 0.229 = 1.306
The normalized value is 1.306 — slightly above 1 because this pixel is brighter than the ImageNet average.
Use ImageNet normalization when using pre-trained models (ResNet, VGG, EfficientNet). These models were trained with it; inputs without it look "wrong" to the model.
Memory Layout and Axis Conventions
Two conventions — knowing which one your framework uses prevents shape errors:
| Framework | Axis order | Shape for 1 image | Shape for batch of 32 |
|---|---|---|---|
| TensorFlow / Keras | (H, W, C) | (4, 4, 3) | (32, 4, 4, 3) |
| PyTorch | (C, H, W) | (3, 4, 4) | (32, 3, 4, 4) |
The batch dimension N is always first: (N, ...).
Memory calculation example:
For a 224×224 RGB image (float32, 4 bytes per value):
224 × 224 × 3 × 4 bytes = 602,112 bytes ≈ 602 KB per image
Batch of 32 images: 32 × 602 KB = 19.3 MB per batch — this is a typical GPU memory allocation during training.
Trace Table
| Operation | Formula | Values | Result |
|---|---|---|---|
| Pixel count (4×4×3) | H × W × C | 4 × 4 × 3 | 48 values |
| Simple normalize | pixel / 255 | 200 / 255 | 0.784 |
| ImageNet normalize (R) | (pixel/255 − μ) / σ | (0.784 − 0.485) / 0.229 | 1.306 |
| Memory (224×224 RGB float32) | H × W × C × 4 | 224 × 224 × 3 × 4 | 602,112 bytes ≈ 602 KB |
| Batch memory (32 images) | batch × img_bytes | 32 × 602,112 | 19.3 MB |
| PyTorch shape (batch of 8, 64×64 RGB) | (N, C, H, W) | (8, 3, 64, 64) | tensor shape |
Code
import numpy as np
# Anchor: 4×4 grayscale
gray = np.array([
[ 0, 0, 255, 255],
[ 0, 0, 0, 255],
[ 0, 0, 255, 0],
[ 0, 255, 0, 0],
], dtype=np.uint8)
print("Grayscale shape:", gray.shape) # (4, 4)
print("CNN input shape (TF):", (*gray.shape, 1)) # (4, 4, 1)
print("CNN input shape (PT):", (1, *gray.shape)) # (1, 4, 4)
# Normalize
gray_f = gray.astype(np.float32) / 255
print("\nNormalized values (sample):")
print(np.round(gray_f, 3))
# ImageNet normalization (on the R channel of the 4×4 RGB anchor)
R = np.array([[20, 60, 180, 255],[5, 15, 30, 200],[3, 10, 200, 20],[2, 200, 10, 10]], dtype=np.float32)
R_norm = (R / 255 - 0.485) / 0.229
print("\nR channel normalized (ImageNet stats):")
print(np.round(R_norm, 3))
# Memory computation for a standard image
H, W, C, bytes_per_val = 224, 224, 3, 4
mem_bytes = H * W * C * bytes_per_val
print(f"\nMemory per 224×224 RGB float32 image: {mem_bytes:,} bytes = {mem_bytes/1024:.1f} KB")
print(f"Batch of 32: {32*mem_bytes/1024/1024:.1f} MB")Grayscale shape: (4, 4)
CNN input shape (TF): (4, 4, 1)
CNN input shape (PT): (1, 4, 4)
Normalized values (sample):
[[0. 0. 1. 1. ]
[0. 0. 0. 1. ]
[0. 0. 1. 0. ]
[0. 1. 0. 0. ]]
R channel normalized (ImageNet stats):
[[-2.118 -1.855 -0.194 1.306]
[-2.293 -2.205 -2.031 0.674]
[-2.336 -2.249 0.674 -2.118]
[-2.358 0.674 -2.249 -2.249]]
Memory per 224×224 RGB float32 image: 602,112 bytes = 588.0 KB
Batch of 32: 18.4 MBPixel 255 in the R channel normalizes to 1.306 (slightly above 1 because 255/255 = 1.0 is above the ImageNet R mean of 0.485). Pixel 20 normalizes to −2.118 — a very dark pixel is far below the ImageNet average.
Related Concepts
Where this builds from: The CNN introduction (post 01) mentioned that a 224×224 RGB image has 150,528 inputs. Now you know why: 224 × 224 × 3 = 150,528 values.
Where this leads: The convolution operation (next post) operates on exactly these tensors — a 3D input tensor with shape (H, W, C) and a 3D filter with shape (k, k, C_in). Understanding the tensor layout is prerequisite to understanding multi-channel convolution (post 09).
Honest Limitations
uint8 → float32 conversion doubles memory usage. Models operate in float32 (4 bytes per value), but images are stored as uint8 (1 byte). Training a model on 1M images requires both the stored dataset (uint8, 4× smaller) and the batch in GPU memory (float32). Mixed-precision training (float16) can halve GPU memory at the cost of some numerical precision.
Normalization statistics should match the training distribution. Using ImageNet statistics on medical images (X-rays, MRIs) is incorrect — the pixel distributions are fundamentally different. Always compute mean and std from your specific training set unless you're using pre-trained ImageNet weights.
The (H, W, C) vs (C, H, W) convention is a persistent source of bugs. A model trained in TensorFlow with a (4, 4, 3) input given a (3, 4, 4) tensor will not raise an error if the spatial dimensions happen to match — it will silently produce wrong results. Always verify axis order when loading weights or moving between frameworks.
Test Your Understanding
-
A grayscale MNIST image is 28×28. When fed to a CNN, what is the shape in (a) TensorFlow (H, W, C) format and (b) PyTorch (C, H, W) format? When batched into groups of 64, what is the full 4D tensor shape for each?
-
A pixel has raw value 100 in the G channel. Apply ImageNet normalization (μ_G = 0.456, σ_G = 0.224). What is the normalized value? Would a pixel value of 116 normalize to approximately 0? Show your calculation.
-
You are loading a model pre-trained on ImageNet and fine-tuning it on a medical imaging dataset where pixel values represent tissue density in Hounsfield units (range −1000 to +3000). Should you use ImageNet normalization? If not, what normalization would be more appropriate?
-
A batch of 16 images, each 512×512 RGB, stored as float32. Compute the total GPU memory required in MB. If you switch to mixed-precision (float16), how much memory would you save?
-
You convert images to grayscale before feeding them to a CNN for flower classification. A colleague argues that this loses important information since flower colors are distinctive features. How would you test whether color channels matter for this specific task? What experiment would you run?