~/blog

Quantization - Making Giant AI Models Fit in the Real World

Mar 28, 2026•13 min read•By mohammed.vasim

LLMQuantizationMath

From zero to practical — with hand math, Python, and honest tradeoffs.

1. What Is Quantization?

You have the number 3.14159265358979.

You only have space for two decimal places. So you write 3.14.

That's quantization — deliberately using less precision to save space, accepting a tiny error.

In AI: model weights stored as 32-bit floats get re-represented with fewer bits (16, 8, or 4). Less memory. Faster inference. Small accuracy cost.

Why it matters:

Model	fp32	int4	Savings
Llama 3 8B	32 GB	4 GB	8×
Llama 3 70B	280 GB	35 GB	8×
GPT-3 175B	700 GB	87 GB	8×

A 70B model at fp32 needs 4× A100 GPUs to load.
At int4, it fits on a single RTX 4090.

2. How Numbers Live in Memory

Every number is stored as bits — 0s and 1s.

Integers

INT8 = 8 bits = 2⁸ = 256 possible values (–128 to +127)

markdown

┌─┬─┬─┬─┬─┬─┬─┬─┐
│S│    magnitude  │
└─┴─┴─┴─┴─┴─┴─┴─┘
 ↑
sign bit

Gaps between values are always equal — uniform spacing:

markdown

–3  –2  –1   0   1   2   3
 |   |   |   |   |   |   |
←── equal gaps everywhere ──→

Floating Point

FP32 = 32 bits, split into 3 fields:

markdown

┌──┬──────────┬──────────────────────┐
│S │ Exponent │      Mantissa        │
│1b│   8 bits │       23 bits        │
└──┴──────────┴──────────────────────┘
 ↑      ↑               ↑
+/–   which range    the digits

Value formula:

markdown

value = (–1)^S  ×  1.mantissa  ×  2^(exponent – 127)

The exponent picks a power-of-2 window.
The mantissa fills in the digits inside that window.

Gaps are small near zero, large far from zero:

markdown

Near 0:    ||||||||||||||||   → dense
Near 1M:   |        |        → sparse

Float formats at a glance

Format	Bits	Exponent	Mantissa	~Digits
fp32	32	8	23	7
bf16	16	8	7	2
fp16	16	5	10	3
fp8	8	4–5	3–4	1
int8	8	—	—	integers
int4	4	—	—	integers

bf16 vs fp16: Both are 16-bit, different tradeoff.
bf16 keeps fp32's exponent range → no overflow during training.
fp16 keeps more mantissa bits → more precise, but narrower range.

3. What "Precision" Actually Means

Precision = how many distinct values you can represent.

More mantissa bits → finer grid → higher precision.

The formal measure: machine epsilon (ε) — the gap between 1.0 and the next representable number.

markdown

fp32:  ε ≈ 1.19 × 10⁻⁷   →  ~7 decimal digits  
fp16:  ε ≈ 9.77 × 10⁻⁴   →  ~3 decimal digits  
bf16:  ε ≈ 7.81 × 10⁻³   →  ~2 decimal digits

fp16 is ~8,000× coarser than fp32 at the same magnitude.

What makes it "floating": relative precision stays roughly constant everywhere.

markdown

value          abs gap       rel gap
──────────     ─────────     ────────
0.001          1.16e-10      ~1.2e-7
1.0            1.19e-07      ~1.2e-7
1,000,000      6.25e-02      ~6.3e-8

The absolute gap at 1,000,000 is 0.0625.
You cannot represent any value between 1000000.0 and 1000000.0625.

4. Why AI Models Are a Memory Problem

Every weight is a number in memory.

markdown

Memory = num_parameters × bytes_per_weight

Bytes per weight:

markdown

fp32  →  4 bytes  
fp16  →  2 bytes  
int8  →  1 byte  
int4  →  0.5 bytes

A 7B model:

markdown

fp32 :  7B × 4   =  28 GB  
fp16 :  7B × 2   =  14 GB  
int8 :  7B × 1   =   7 GB  
int4 :  7B × 0.5 =  3.5 GB

That's just the weights. Total inference memory breakdown:

Component	Typical Size	Notes
Model weights	1–4 bytes/param	Depends on precision (fp16, int8, int4)
KV cache	2–10 × weights	Scales with context length; major memory hog
Activations	Variable	Depends on batch size and model size
CUDA overhead	~1–2 GB	GPU runtime requirements

Total ≈ weights + KV cache + activations + CUDA overhead

Training memory is far worse:

markdown

weights          4 bytes
gradients        4 bytes
Adam m (1st)     4 bytes
Adam v (2nd)     4 bytes
─────────────────────────
total           16 bytes / parameter

7B model training:  7B × 16 = 112 GB

This is why quantization went from optional to essential.

5. Quantization by Hand

Our example weights:

markdown

w = [0.32, –1.54, 0.87, –0.21, 1.23, –0.76, 0.05, –1.91]

A — Symmetric INT8

Symmetric = centered at zero. Range: [–127, +127]. No offset.

Step 1 — Find the largest absolute value

python

abs_max = max(|0.32|, |–1.54|, |0.87|, ...)
        = 1.91

Step 2 — Compute scale

markdown

scale = abs_max / 127
      = 1.91 / 127
      = 0.01504

Step 3 — Quantize q = round(w / scale)

markdown

0.32 / 0.01504 =  21.28  →   21
–1.54 / 0.01504 = –102.4  →  –102
 0.87 / 0.01504 =  57.85  →   58
–0.21 / 0.01504 = –13.96  →  –14
 1.23 / 0.01504 =  81.78  →   82
–0.76 / 0.01504 = –50.53  →  –51
 0.05 / 0.01504 =   3.32  →    3
–1.91 / 0.01504 = –127.0  →  –127

These integers are stored. 1 byte each instead of 4.

Step 4 — Dequantize w̃ = q × scale

markdown

21 × 0.01504 =  0.3158    original  0.32   Δ = +0.004
–102 × 0.01504 = –1.5341    original –1.54   Δ = –0.006
  58 × 0.01504 =  0.8723    original  0.87   Δ = –0.002
 –14 × 0.01504 = –0.2106    original –0.21   Δ = –0.001
  82 × 0.01504 =  1.2333    original  1.23   Δ = –0.003
 –51 × 0.01504 = –0.7670    original –0.76   Δ = +0.007
   3 × 0.01504 =  0.0451    original  0.05   Δ = +0.005
–127 × 0.01504 = –1.9101    original –1.91   Δ =  0.000

Max error: 0.007 — barely noticeable.

B — Asymmetric INT4

Asymmetric = uses a zero-point to shift the grid.
Range: [0, 15] — only 16 levels.

Step 1 — Find min and max

python

min = –1.91
max =  1.23

Step 2 — Compute scale

python

scale = (max – min) / (2⁴ – 1)
      = (1.23 – (–1.91)) / 15
      = 3.14 / 15
      = 0.2093

Step 3 — Compute zero-point

python

zero_point = round(–min / scale)
           = round(1.91 / 0.2093)
           = round(9.12)
           = 9

Float 0.0 now maps to integer 9. The grid covers [–1.88, +1.26].

Step 4 — Quantize q = clamp(round(w / scale) + zp, 0, 15)

markdown

0.32:  round( 1.53) + 9 =  2 + 9 = 11
–1.54:  round(–7.36) + 9 = –7 + 9 =  2
 0.87:  round( 4.16) + 9 =  4 + 9 = 13
–0.21:  round(–1.00) + 9 = –1 + 9 =  8
 1.23:  round( 5.88) + 9 =  6 + 9 = 15
–0.76:  round(–3.63) + 9 = –4 + 9 =  5
 0.05:  round( 0.24) + 9 =  0 + 9 =  9
–1.91:  round(–9.12) + 9 = –9 + 9 =  0

Step 5 — Dequantize w̃ = (q – zp) × scale

markdown

(11–9) × 0.2093 =  0.4187    original  0.32   Δ = –0.099
( 2–9) × 0.2093 = –1.4651    original –1.54   Δ = –0.075
(13–9) × 0.2093 =  0.8372    original  0.87   Δ = +0.033
( 8–9) × 0.2093 = –0.2093    original –0.21   Δ = +0.001
(15–9) × 0.2093 =  1.2558    original  1.23   Δ = –0.026
( 5–9) × 0.2093 = –0.8372    original –0.76   Δ = +0.077
( 9–9) × 0.2093 =  0.0000    original  0.05   Δ = +0.050
( 0–9) × 0.2093 = –1.8837    original –1.91   Δ = +0.026

Max error: 0.099 — 14× worse than INT8.

Why? INT8 has 256 levels, INT4 has 16. Each step is 14× wider.

markdown

INT8 step:  1.91 / 127 = 0.015   (fine)
INT4 step:  3.14 / 15  = 0.209   (coarse)

C — Per-Group INT4

Instead of one scale for all 8 values, compute one scale per group of 4.

markdown

Group 0: [0.32, –1.54,  0.87, –0.21]  →  scale = 0.161,  zp = 10
Group 1: [1.23, –0.76,  0.05, –1.91]  →  scale = 0.209,  zp = 9

Each group adapts its scale to its local range — fewer wasted levels.

markdown

Per-tensor:  max error = 0.099
Per-group:   max error = 0.077   (1.3× better)

In real models with 128+ values per group, the improvement is much larger.

6. Quantization in Python

Symmetric INT8

python

import numpy as np

weights = np.array([0.32, -1.54, 0.87, -0.21,
                    1.23, -0.76,  0.05, -1.91])

# --- quantize ---
abs_max = np.max(np.abs(weights))
scale   = abs_max / 127.0
q       = np.round(weights / scale).astype(np.int8)

# scale = 0.015039
# q     = [  21 -102   58  -14   82  -51    3 -127]

# --- dequantize ---
recovered = q.astype(np.float32) * scale
errors    = weights - recovered

print(f"max error : {np.max(np.abs(errors)):.6f}")   # 0.007008
print(f"mean error: {np.mean(np.abs(errors)):.6f}")  # 0.003514
print(f"memory    : {weights.nbytes}B → {q.nbytes}B  ({weights.nbytes//q.nbytes}× smaller)")
# memory    : 64B → 8B  (8× smaller)

Asymmetric INT4

python

def quantize_int4(weights):
    w_min  = weights.min()
    w_max  = weights.max()
    scale  = (w_max - w_min) / 15
    zp     = int(np.round(-w_min / scale))
    q      = np.clip(np.round(weights / scale) + zp, 0, 15)
    return q.astype(np.int8), scale, zp

def dequantize_int4(q, scale, zp):
    return (q.astype(np.float32) - zp) * scale

q4, s4, zp4 = quantize_int4(weights)
r4          = dequantize_int4(q4, s4, zp4)

# scale = 0.2093,  zero_point = 9
# q     = [11  2 13  8 15  5  9  0]
# max error = 0.098667

Per-Group INT4

python

def quantize_per_group(weights, group_size=4):
    n      = len(weights)
    q_out  = np.zeros(n, dtype=np.int8)
    dq_out = np.zeros(n, dtype=np.float32)

    for g in range(n // group_size):
        lo    = g * group_size
        hi    = lo + group_size
        chunk = weights[lo:hi]

        mn    = chunk.min()
        mx    = chunk.max()
        scale = (mx - mn) / 15
        zp    = int(np.round(-mn / scale))

        q           = np.clip(np.round(chunk / scale) + zp, 0, 15)
        dq          = (q.astype(np.float32) - zp) * scale
        q_out[lo:hi]  = q
        dq_out[lo:hi] = dq

    return q_out, dq_out

_, dq_g = quantize_per_group(weights, group_size=4)

print(f"per-tensor max error: {np.max(np.abs(weights - r4)):.4f}")  # 0.0987
print(f"per-group  max error: {np.max(np.abs(weights - dq_g)):.4f}")  # 0.0773

Machine Epsilon

python

for dtype, name in [(np.float32, 'fp32'), (np.float16, 'fp16')]:
    eps = np.finfo(dtype).eps
    nxt = np.nextafter(dtype(1.0), dtype(2.0))
    gap = nxt - dtype(1.0)
    print(f"{name}:  ε = {eps:.2e}   gap after 1.0 = {gap:.2e}")

# fp32:  ε = 1.19e-07   gap after 1.0 = 1.19e-07
# fp16:  ε = 9.77e-04   gap after 1.0 = 9.77e-04

v = 1.23456789
print(f"fp32: {np.float32(v):.8f}   error: {abs(v - float(np.float32(v))):.2e}")
print(f"fp16: {np.float16(v):.8f}   error: {abs(v - float(np.float16(v))):.2e}")
# fp32: 1.23456788   error: 9.37e-09
# fp16: 1.23437500   error: 1.93e-04

7. Types of Quantization

By timing

Post-Training Quantization (PTQ) — compress after training.

markdown

trained fp32 model
        ↓
  ~128 calibration samples
        ↓
  measure activation ranges
        ↓
  compute scale + zero-point
        ↓
  quantized model

✅ Fast. No retraining needed.
⚠️ Some accuracy loss, especially below int8.

Quantization-Aware Training (QAT) — simulate quantization during training.

markdown

training loop:
  forward pass
        ↓
  fake-quantize weights   ← round then unround (simulate int4)
        ↓
  compute loss
        ↓
  backward pass           ← gradients still flow
        ↓
  update weights
        ↓  (repeat)
  export truly quantized model

✅ Best accuracy at int4/int2.
⚠️ Requires full retraining.

By granularity

Per-tensor — one scale for the whole matrix.

markdown

┌──────────────────────┐
│  scale = 0.021       │
│  0.32  –1.54  0.87  │
│ –0.21   1.23 –0.76  │
└──────────────────────┘

Simple. Fastest. Least accurate.

Per-channel — one scale per row.

markdown

s₀ → │  0.32  –1.54   0.87  │
s₁ → │ –0.21   1.23  –0.76  │
s₂ → │  0.05  –1.91   0.44  │

Standard for int8. Noticeably better accuracy.

Per-group — one scale per N values within a row.

markdown

row: [0.32, –1.54, 0.87, –0.21 | 1.23, –0.76, 0.05, –1.91]
      ─── group 0 (s=0.161) ────   ─── group 1 (s=0.209) ───

Best accuracy. Used by GPTQ, AWQ, GGUF. Common sizes: 32, 64, 128.

By target

Target	What's quantized	Main benefit
Weight-only	Parameters only	Memory savings
Weight + activation	Parameters + layer outputs	Full compute speedup

Weights are easy — stable distributions, computed once.
Activations are harder — change with every input, prone to outliers.

8. How a Real Model Gets Quantized

End-to-end with GPTQ (most common PTQ for LLMs).

Step 1 — Load in fp16

python

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype="float16",
    device_map="auto",
)
# VRAM: ~14 GB

Step 2 — Calibration data

python

samples = [
    tokenizer(s["text"], return_tensors="pt",
              max_length=512, truncation=True)
    for s, _ in zip(dataset, range(128))
]

128 samples reveals how activations are distributed in real use.

Step 3 — Layer-by-layer quantization

markdown

for each weight matrix:
  1. run calibration → observe input activations
  2. compute Hessian  (which weights matter most)
  3. quantize column by column
  4. compensate remaining columns for error introduced
  5. store: int4 weights + fp16 scales + int8 zero-points

Step 4 is the key insight — not naive rounding.

Step 4 — Mixed precision

Not every layer quantizes equally well:

markdown

Layer             Precision   Reason
────────────────────────────────────────
Embeddings        fp16        critical, small
First attention   int8        seen on every token
Middle layers     int4        bulk of parameters
LM head           fp16        picks next token directly

This is why GGUF filenames encode the strategy:

markdown

Q4_K_M.gguf

Q4  = 4-bit weights
K   = k-quants (smarter grouping)
M   = medium (some layers at higher precision)

Other variants:
  Q8_0    = 8-bit, best quality
  Q4_K_S  = more int4, smaller file
  Q2_K    = 2-bit, smallest, worst quality

Step 5 — Save and run

python

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

config  = BaseQuantizeConfig(bits=4, group_size=128)
model_q = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3-8B", config)
model_q.quantize(samples)
model_q.save_quantized("llama-3-8b-gptq-int4")
# Saved: ~4.5 GB  (was 14 GB)

9. Memory Requirements

Model weights

markdown

Memory (GB) = params (B) × bytes per weight

           fp32   fp16   int8   int4
           ────   ────   ────   ────
1B:         4      2      1     0.5
7B:        28     14      7     3.5
13B:       52     26     13     6.5
70B:      280    140     70      35

KV cache

markdown

KV cache =
    2                   (key + value)
  × num_layers
  × num_kv_heads
  × head_dim
  × context_length
  × bytes_per_element

Llama 3 8B, ctx=8192, fp16:
  2 × 32 × 8 × 128 × 8192 × 2 ≈ 8 GB

A 7B int4 model (3.5 GB weights) at 8K context needs ~12 GB total.

Training memory

markdown

weights      4 bytes
gradients    4 bytes
Adam m       4 bytes
Adam v       4 bytes
─────────────────────
total       16 bytes / parameter

7B:   7B × 16 = 112 GB
13B: 13B × 16 = 208 GB

GPU guide

markdown

GPU                VRAM    Max model (int4)
──────────────────────────────────────────
RTX 3080          10 GB   7B  (tight)
RTX 3090/4090     24 GB   13B (comfortable)
A100 40GB         40 GB   30B (comfortable)
A100 80GB         80 GB   70B (comfortable)
2× A100 80GB     160 GB   70B in fp16

10. Tradeoffs

Accuracy vs bits

markdown

Bits   Compression   Accuracy loss
─────────────────────────────────────
fp16      2×         < 0.1%  (imperceptible)
int8      4×         < 1%    (rarely noticeable)
int4      8×         1–5%    (noticeable on hard tasks)
int2     16×         10–30%  (often unusable)

Loss is worse on: reasoning tasks, long contexts, smaller models.

Speed: it's bandwidth, not compute

LLM inference is memory-bandwidth bound — the GPU spends most of its time reading weights, not multiplying them.

markdown

int4 = 8× fewer bytes to read
     → 2–4× real speedup (after overhead)

For int8 compute speedup you need compatible tensor cores:

markdown

Supported:    NVIDIA A100, RTX 30xx/40xx, Apple M-series
Not supported: older GPUs  (int8 may actually be slower)

Granularity vs overhead

Method	Accuracy	Memory overhead
Per-tensor	lowest	2 values total
Per-channel	better	2 per row
Per-group g=128	good	~3–5%
Per-group g=32	best	~12%

The outlier problem

In transformers, ~0.1% of activation values are 100× larger than the rest.

markdown

normal: [0.10, 0.18, 0.15, 0.20, ...]
outlier: [0.10, 0.18, 127.4, 0.20, ...]
                       ↑
           forces scale = 127.4 / 127 = 1.003

now:  0.10 → round(0.10 / 1.003) = 0
      0.18 → round(0.18 / 1.003) = 0
      0.20 → round(0.20 / 1.003) = 0

99.9% of values collapse to zero. Precision destroyed.

Solutions:

Method	Approach
LLM.int8()	outlier channels stay fp16, rest in int8
SmoothQuant	move outlier magnitude from activations to weights
AWQ	protect weights linked to outlier channels

11. Tools You'll Actually Use

bitsandbytes — easiest start

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NormalFloat4, better than int4
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 after dequant
    bnb_4bit_use_double_quant=True,         # quantize the scales too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb,
    device_map="auto",
)

llama.cpp / Ollama — local inference

bash

ollama run llama3   # picks best quant for your hardware automatically

python

from llama_cpp import Llama

llm = Llama(
    model_path="llama-3-8b.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,   # offload all layers to GPU
)
out = llm("What is quantization?", max_tokens=200)

AWQ — best accuracy at int4

python

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
model.quantize(
    tokenizer,
    quant_config={"zero_point": True, "q_group_size": 128, "w_bit": 4},
)
model.save_quantized("llama-3-8b-awq")

model = AutoAWQForCausalLM.from_quantized("llama-3-8b-awq", fuse_layers=True)

Choose your tool

Goal	Tool	Format
Quickest start	bitsandbytes	in-memory
Best int4 accuracy	AWQ or GPTQ	`.safetensors`
Local / offline	llama.cpp, Ollama	`.gguf`
Production NVIDIA	TensorRT-LLM	TRT engine
Mobile / edge	TFLite (QAT)	`.tflite`

12. Summary

The core idea

Map a continuous float range onto a discrete integer grid.
Store the index. Recover with a scale factor.
Error = half a grid step. Fewer bits = wider steps = more error.

The two formulas

Quantize:

markdown

q = clamp( round(w / scale) + zero_point,  min_int,  max_int )

Dequantize:

markdown

w̃ = (q – zero_point) × scale

Symmetric (for weights):

markdown

zero_point = 0
scale      = abs_max / max_int

Asymmetric (for activations):

markdown

scale      = (max – min) / (2^bits – 1)
zero_point = round(–min / scale)

Three decisions every quantization involves

markdown

1. Bit width

   int4 ←─────────────────────→ fp32
   small / fast / lossy          large / slow / lossless

2. Granularity

   per-tensor ←───────────────→ per-group (g=32)
   simple                        accurate

3. Timing

   PTQ ←──────────────────────→ QAT
   no retraining                 full retraining

Quick use-case guide

Situation	Recommendation
Run 7B on a laptop	int4 GGUF via Ollama
Run 70B on 1–2 GPUs	int4 GPTQ or AWQ
Fine-tune a quantized model	QLoRA (LoRA on int4 base)
Production NVIDIA serving	int8 TensorRT
Train from scratch, low VRAM	bf16 + gradient checkpointing

Quantization - Making Giant AI Models Fit in the Real World

1. What Is Quantization?

2. How Numbers Live in Memory

Integers

Floating Point

Float formats at a glance

3. What "Precision" Actually Means

4. Why AI Models Are a Memory Problem

5. Quantization by Hand

A — Symmetric INT8

B — Asymmetric INT4

C — Per-Group INT4

6. Quantization in Python

Symmetric INT8

Asymmetric INT4

Per-Group INT4

Machine Epsilon

7. Types of Quantization

By timing

By granularity

By target

8. How a Real Model Gets Quantized

9. Memory Requirements

Model weights

KV cache

Training memory

GPU guide

10. Tradeoffs

Accuracy vs bits

Speed: it's bandwidth, not compute

Granularity vs overhead

The outlier problem

11. Tools You'll Actually Use

bitsandbytes — easiest start

llama.cpp / Ollama — local inference

AWQ — best accuracy at int4

Choose your tool

12. Summary

The core idea

The two formulas

Three decisions every quantization involves

Quick use-case guide

Further Reading

Comments (0)

Leave a comment