~/blog

All You Need to Know About Images

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Before writing a single line of CNN code, you need to understand exactly what data a CNN receives. An image is not a picture — it is a tensor. Every pixel is a number. Every image is a 3D array. Every batch of images is a 4D array. Misunderstanding the shape or scale of this data is the source of a surprising number of bugs in production computer vision code.

Anchor: 4×4 grayscale image of a simplified digit.

text
Pixel values (0 = black, 255 = white):
  0   0 255 255
  0   0   0 255
  0   0 255   0
  0 255   0   0

Grayscale: One Channel

A grayscale image is a 2D matrix. Each entry is an integer between 0 (black) and 255 (white). The anchor is a 4×4 matrix with dtype uint8 (unsigned 8-bit integer, range 0–255).

Shape: (H, W) = (4, 4) — height × width.

When fed to a CNN, this becomes a 3D tensor by adding the channel dimension: (H, W, C) = (4, 4, 1) in TensorFlow convention, or (C, H, W) = (1, 4, 4) in PyTorch convention.


RGB: Three Channels

A color image has 3 channels: Red, Green, Blue. Each channel is a full H×W matrix of pixel values.

For a 4×4 RGB image:

text
R channel (reds):        G channel (greens):      B channel (blues):
 20  60 180 255           10  40 120 200           5   20  80 150
  5  15  30 200           3   10  20 150           2    5  10 100
  3  10 200  20           2    5 140  10           1    2  90   5
  2 200  10  10           1  130   5   5           1   70   3   3

Shape: (H, W, C) = (4, 4, 3) in TensorFlow format, (C, H, W) = (3, 4, 4) in PyTorch format.

The full 4×4 RGB tensor contains 4 × 4 × 3 = 48 values — three 4×4 matrices stacked.

RGB Image as Three Channel Matrices R channel 20 60 180 255 5 15 30 200 3 10 200 20 2 200 10 10 4×4×1 G channel 10 40 120 200 3 10 20 150 4×4×1 B channel 5 20 80 150 2 5 10 100 4×4×1 Stack all 3 → shape (4, 4, 3) in TF or (3, 4, 4) in PyTorch

Normalization

Raw pixel values (0–255) are too large for neural network training. Large input values produce large activations and large gradients, making training unstable.

Simple normalization: divide by 255 → values in [0, 1].

Pixel value 200 → 200/255 = 0.784

ImageNet normalization (per-channel mean subtraction and std scaling):

ImageNet statistics computed over 1.2M training images:

  • R channel: mean = 0.485, std = 0.229
  • G channel: mean = 0.456, std = 0.224
  • B channel: mean = 0.406, std = 0.225

For a pixel with raw value 200 in the R channel:

Step 1: divide by 255 → 200/255 = 0.784

Step 2: subtract mean → 0.784 − 0.485 = 0.299

Step 3: divide by std → 0.299 / 0.229 = 1.306

The normalized value is 1.306 — slightly above 1 because this pixel is brighter than the ImageNet average.

Use ImageNet normalization when using pre-trained models (ResNet, VGG, EfficientNet). These models were trained with it; inputs without it look "wrong" to the model.


Memory Layout and Axis Conventions

Two conventions — knowing which one your framework uses prevents shape errors:

FrameworkAxis orderShape for 1 imageShape for batch of 32
TensorFlow / Keras(H, W, C)(4, 4, 3)(32, 4, 4, 3)
PyTorch(C, H, W)(3, 4, 4)(32, 3, 4, 4)

The batch dimension N is always first: (N, ...).

Memory calculation example:

For a 224×224 RGB image (float32, 4 bytes per value):

224 × 224 × 3 × 4 bytes = 602,112 bytes ≈ 602 KB per image

Batch of 32 images: 32 × 602 KB = 19.3 MB per batch — this is a typical GPU memory allocation during training.


Trace Table

OperationFormulaValuesResult
Pixel count (4×4×3)H × W × C4 × 4 × 348 values
Simple normalizepixel / 255200 / 2550.784
ImageNet normalize (R)(pixel/255 − μ) / σ(0.784 − 0.485) / 0.2291.306
Memory (224×224 RGB float32)H × W × C × 4224 × 224 × 3 × 4602,112 bytes ≈ 602 KB
Batch memory (32 images)batch × img_bytes32 × 602,11219.3 MB
PyTorch shape (batch of 8, 64×64 RGB)(N, C, H, W)(8, 3, 64, 64)tensor shape

Code

python
import numpy as np

# Anchor: 4×4 grayscale
gray = np.array([
    [  0,   0, 255, 255],
    [  0,   0,   0, 255],
    [  0,   0, 255,   0],
    [  0, 255,   0,   0],
], dtype=np.uint8)

print("Grayscale shape:", gray.shape)           # (4, 4)
print("CNN input shape (TF):", (*gray.shape, 1))  # (4, 4, 1)
print("CNN input shape (PT):", (1, *gray.shape))  # (1, 4, 4)

# Normalize
gray_f = gray.astype(np.float32) / 255
print("\nNormalized values (sample):")
print(np.round(gray_f, 3))

# ImageNet normalization (on the R channel of the 4×4 RGB anchor)
R = np.array([[20, 60, 180, 255],[5, 15, 30, 200],[3, 10, 200, 20],[2, 200, 10, 10]], dtype=np.float32)
R_norm = (R / 255 - 0.485) / 0.229
print("\nR channel normalized (ImageNet stats):")
print(np.round(R_norm, 3))

# Memory computation for a standard image
H, W, C, bytes_per_val = 224, 224, 3, 4
mem_bytes = H * W * C * bytes_per_val
print(f"\nMemory per 224×224 RGB float32 image: {mem_bytes:,} bytes = {mem_bytes/1024:.1f} KB")
print(f"Batch of 32: {32*mem_bytes/1024/1024:.1f} MB")
text
Grayscale shape: (4, 4)
CNN input shape (TF): (4, 4, 1)
CNN input shape (PT): (1, 4, 4)

Normalized values (sample):
[[0.    0.    1.    1.   ]
 [0.    0.    0.    1.   ]
 [0.    0.    1.    0.   ]
 [0.    1.    0.    0.   ]]

R channel normalized (ImageNet stats):
[[-2.118 -1.855 -0.194  1.306]
 [-2.293 -2.205 -2.031  0.674]
 [-2.336 -2.249  0.674 -2.118]
 [-2.358  0.674 -2.249 -2.249]]

Memory per 224×224 RGB float32 image: 602,112 bytes = 588.0 KB
Batch of 32: 18.4 MB

Pixel 255 in the R channel normalizes to 1.306 (slightly above 1 because 255/255 = 1.0 is above the ImageNet R mean of 0.485). Pixel 20 normalizes to −2.118 — a very dark pixel is far below the ImageNet average.


Where this builds from: The CNN introduction (post 01) mentioned that a 224×224 RGB image has 150,528 inputs. Now you know why: 224 × 224 × 3 = 150,528 values.

Where this leads: The convolution operation (next post) operates on exactly these tensors — a 3D input tensor with shape (H, W, C) and a 3D filter with shape (k, k, C_in). Understanding the tensor layout is prerequisite to understanding multi-channel convolution (post 09).


Honest Limitations

uint8 → float32 conversion doubles memory usage. Models operate in float32 (4 bytes per value), but images are stored as uint8 (1 byte). Training a model on 1M images requires both the stored dataset (uint8, 4× smaller) and the batch in GPU memory (float32). Mixed-precision training (float16) can halve GPU memory at the cost of some numerical precision.

Normalization statistics should match the training distribution. Using ImageNet statistics on medical images (X-rays, MRIs) is incorrect — the pixel distributions are fundamentally different. Always compute mean and std from your specific training set unless you're using pre-trained ImageNet weights.

The (H, W, C) vs (C, H, W) convention is a persistent source of bugs. A model trained in TensorFlow with a (4, 4, 3) input given a (3, 4, 4) tensor will not raise an error if the spatial dimensions happen to match — it will silently produce wrong results. Always verify axis order when loading weights or moving between frameworks.


Test Your Understanding

  1. A grayscale MNIST image is 28×28. When fed to a CNN, what is the shape in (a) TensorFlow (H, W, C) format and (b) PyTorch (C, H, W) format? When batched into groups of 64, what is the full 4D tensor shape for each?

  2. A pixel has raw value 100 in the G channel. Apply ImageNet normalization (μ_G = 0.456, σ_G = 0.224). What is the normalized value? Would a pixel value of 116 normalize to approximately 0? Show your calculation.

  3. You are loading a model pre-trained on ImageNet and fine-tuning it on a medical imaging dataset where pixel values represent tissue density in Hounsfield units (range −1000 to +3000). Should you use ImageNet normalization? If not, what normalization would be more appropriate?

  4. A batch of 16 images, each 512×512 RGB, stored as float32. Compute the total GPU memory required in MB. If you switch to mixed-precision (float16), how much memory would you save?

  5. You convert images to grayscale before feeding them to a CNN for flower classification. A colleague argues that this loses important information since flower colors are distinctive features. How would you test whether color channels matter for this specific task? What experiment would you run?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment