Back to blog

~/blog

Quantization - Making Giant AI Models Fit in the Real World

Mar 28, 2026β€’13 min readβ€’By Mohammed Vasim
LLMQuantizationMath

From zero to practical β€” with hand math, Python, and honest tradeoffs.

1. What Is Quantization?

You have the number 3.14159265358979.

You only have space for two decimal places. So you write 3.14.

That's quantization β€” deliberately using less precision to save space, accepting a tiny error.

In AI: model weights stored as 32-bit floats get re-represented with fewer bits (16, 8, or 4). Less memory. Faster inference. Small accuracy cost.

Why it matters:

Modelfp32int4Savings
Llama 3 8B32 GB4 GB8Γ—
Llama 3 70B280 GB35 GB8Γ—
GPT-3 175B700 GB87 GB8Γ—

A 70B model at fp32 needs 4Γ— A100 GPUs to load.
At int4, it fits on a single RTX 4090.


2. How Numbers Live in Memory

Every number is stored as bits β€” 0s and 1s.

Integers

INT8 = 8 bits = 2⁸ = 256 possible values (–128 to +127)

markdown
β”Œβ”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”
β”‚Sβ”‚    magnitude  β”‚
β””β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”˜
 ↑
sign bit

Gaps between values are always equal β€” uniform spacing:

markdown
–3  –2  –1   0   1   2   3
 |   |   |   |   |   |   |
←── equal gaps everywhere ──→

Floating Point

FP32 = 32 bits, split into 3 fields:

markdown
β”Œβ”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚S β”‚ Exponent β”‚      Mantissa        β”‚
β”‚1bβ”‚   8 bits β”‚       23 bits        β”‚
β””β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 ↑      ↑               ↑
+/–   which range    the digits

Value formula:

markdown
value = (–1)^S  Γ—  1.mantissa  Γ—  2^(exponent – 127)

The exponent picks a power-of-2 window.
The mantissa fills in the digits inside that window.

Gaps are small near zero, large far from zero:

markdown
Near 0:    ||||||||||||||||   β†’ dense
Near 1M:   |        |        β†’ sparse

Float formats at a glance

FormatBitsExponentMantissa~Digits
fp32328237
bf1616872
fp16165103
fp884–53–41
int88β€”β€”integers
int44β€”β€”integers

bf16 vs fp16: Both are 16-bit, different tradeoff.
bf16 keeps fp32's exponent range β†’ no overflow during training.
fp16 keeps more mantissa bits β†’ more precise, but narrower range.


3. What "Precision" Actually Means

Precision = how many distinct values you can represent.

More mantissa bits β†’ finer grid β†’ higher precision.

The formal measure: machine epsilon (Ξ΅) β€” the gap between 1.0 and the next representable number.

markdown
fp32:  Ξ΅ β‰ˆ 1.19 Γ— 10⁻⁷   β†’  ~7 decimal digits  
fp16:  Ξ΅ β‰ˆ 9.77 Γ— 10⁻⁴   β†’  ~3 decimal digits  
bf16:  Ξ΅ β‰ˆ 7.81 Γ— 10⁻³   β†’  ~2 decimal digits

fp16 is ~8,000Γ— coarser than fp32 at the same magnitude.

What makes it "floating": relative precision stays roughly constant everywhere.

markdown
value          abs gap       rel gap
──────────     ─────────     ────────
0.001          1.16e-10      ~1.2e-7
1.0            1.19e-07      ~1.2e-7
1,000,000      6.25e-02      ~6.3e-8

The absolute gap at 1,000,000 is 0.0625.
You cannot represent any value between 1000000.0 and 1000000.0625.


4. Why AI Models Are a Memory Problem

Every weight is a number in memory.

markdown
Memory = num_parameters Γ— bytes_per_weight

Bytes per weight:

markdown
fp32  β†’  4 bytes  
fp16  β†’  2 bytes  
int8  β†’  1 byte  
int4  β†’  0.5 bytes

A 7B model:

markdown
fp32 :  7B Γ— 4   =  28 GB  
fp16 :  7B Γ— 2   =  14 GB  
int8 :  7B Γ— 1   =   7 GB  
int4 :  7B Γ— 0.5 =  3.5 GB

That's just the weights. Total inference memory breakdown:

ComponentTypical SizeNotes
Model weights1–4 bytes/paramDepends on precision (fp16, int8, int4)
KV cache2–10 Γ— weightsScales with context length; major memory hog
ActivationsVariableDepends on batch size and model size
CUDA overhead~1–2 GBGPU runtime requirements

Total β‰ˆ weights + KV cache + activations + CUDA overhead

Training memory is far worse:

markdown
weights          4 bytes
gradients        4 bytes
Adam m (1st)     4 bytes
Adam v (2nd)     4 bytes
─────────────────────────
total           16 bytes / parameter

7B model training:  7B Γ— 16 = 112 GB

This is why quantization went from optional to essential.


5. Quantization by Hand

Our example weights:

markdown
w = [0.32, –1.54, 0.87, –0.21, 1.23, –0.76, 0.05, –1.91]

A β€” Symmetric INT8

Symmetric = centered at zero. Range: [–127, +127]. No offset.

Step 1 β€” Find the largest absolute value

python
abs_max = max(|0.32|, |–1.54|, |0.87|, ...)
        = 1.91

Step 2 β€” Compute scale

markdown
scale = abs_max / 127
      = 1.91 / 127
      = 0.01504

Step 3 β€” Quantize q = round(w / scale)

markdown
0.32 / 0.01504 =  21.28  β†’   21
–1.54 / 0.01504 = –102.4  β†’  –102
 0.87 / 0.01504 =  57.85  β†’   58
–0.21 / 0.01504 = –13.96  β†’  –14
 1.23 / 0.01504 =  81.78  β†’   82
–0.76 / 0.01504 = –50.53  β†’  –51
 0.05 / 0.01504 =   3.32  β†’    3
–1.91 / 0.01504 = –127.0  β†’  –127

These integers are stored. 1 byte each instead of 4.

Step 4 — Dequantize w̃ = q × scale

markdown
21 Γ— 0.01504 =  0.3158    original  0.32   Ξ” = +0.004
–102 Γ— 0.01504 = –1.5341    original –1.54   Ξ” = –0.006
  58 Γ— 0.01504 =  0.8723    original  0.87   Ξ” = –0.002
 –14 Γ— 0.01504 = –0.2106    original –0.21   Ξ” = –0.001
  82 Γ— 0.01504 =  1.2333    original  1.23   Ξ” = –0.003
 –51 Γ— 0.01504 = –0.7670    original –0.76   Ξ” = +0.007
   3 Γ— 0.01504 =  0.0451    original  0.05   Ξ” = +0.005
–127 Γ— 0.01504 = –1.9101    original –1.91   Ξ” =  0.000

Max error: 0.007 β€” barely noticeable.


B β€” Asymmetric INT4

Asymmetric = uses a zero-point to shift the grid.
Range: [0, 15] β€” only 16 levels.

Step 1 β€” Find min and max

python
min = –1.91
max =  1.23

Step 2 β€” Compute scale

python
scale = (max – min) / (2⁴ – 1)
      = (1.23 – (–1.91)) / 15
      = 3.14 / 15
      = 0.2093

Step 3 β€” Compute zero-point

python
zero_point = round(–min / scale)
           = round(1.91 / 0.2093)
           = round(9.12)
           = 9

Float 0.0 now maps to integer 9. The grid covers [–1.88, +1.26].

Step 4 β€” Quantize q = clamp(round(w / scale) + zp, 0, 15)

markdown
0.32:  round( 1.53) + 9 =  2 + 9 = 11
–1.54:  round(–7.36) + 9 = –7 + 9 =  2
 0.87:  round( 4.16) + 9 =  4 + 9 = 13
–0.21:  round(–1.00) + 9 = –1 + 9 =  8
 1.23:  round( 5.88) + 9 =  6 + 9 = 15
–0.76:  round(–3.63) + 9 = –4 + 9 =  5
 0.05:  round( 0.24) + 9 =  0 + 9 =  9
–1.91:  round(–9.12) + 9 = –9 + 9 =  0

Step 5 β€” Dequantize wΜƒ = (q – zp) Γ— scale

markdown
(11–9) Γ— 0.2093 =  0.4187    original  0.32   Ξ” = –0.099
( 2–9) Γ— 0.2093 = –1.4651    original –1.54   Ξ” = –0.075
(13–9) Γ— 0.2093 =  0.8372    original  0.87   Ξ” = +0.033
( 8–9) Γ— 0.2093 = –0.2093    original –0.21   Ξ” = +0.001
(15–9) Γ— 0.2093 =  1.2558    original  1.23   Ξ” = –0.026
( 5–9) Γ— 0.2093 = –0.8372    original –0.76   Ξ” = +0.077
( 9–9) Γ— 0.2093 =  0.0000    original  0.05   Ξ” = +0.050
( 0–9) Γ— 0.2093 = –1.8837    original –1.91   Ξ” = +0.026

Max error: 0.099 β€” 14Γ— worse than INT8.

Why? INT8 has 256 levels, INT4 has 16. Each step is 14Γ— wider.

markdown
INT8 step:  1.91 / 127 = 0.015   (fine)
INT4 step:  3.14 / 15  = 0.209   (coarse)

C β€” Per-Group INT4

Instead of one scale for all 8 values, compute one scale per group of 4.

markdown
Group 0: [0.32, –1.54,  0.87, –0.21]  β†’  scale = 0.161,  zp = 10
Group 1: [1.23, –0.76,  0.05, –1.91]  β†’  scale = 0.209,  zp = 9

Each group adapts its scale to its local range β€” fewer wasted levels.

markdown
Per-tensor:  max error = 0.099
Per-group:   max error = 0.077   (1.3Γ— better)

In real models with 128+ values per group, the improvement is much larger.


6. Quantization in Python

Symmetric INT8

python
import numpy as np

weights = np.array([0.32, -1.54, 0.87, -0.21,
                    1.23, -0.76,  0.05, -1.91])

# --- quantize ---
abs_max = np.max(np.abs(weights))
scale   = abs_max / 127.0
q       = np.round(weights / scale).astype(np.int8)

# scale = 0.015039
# q     = [  21 -102   58  -14   82  -51    3 -127]

# --- dequantize ---
recovered = q.astype(np.float32) * scale
errors    = weights - recovered

print(f"max error : {np.max(np.abs(errors)):.6f}")   # 0.007008
print(f"mean error: {np.mean(np.abs(errors)):.6f}")  # 0.003514
print(f"memory    : {weights.nbytes}B β†’ {q.nbytes}B  ({weights.nbytes//q.nbytes}Γ— smaller)")
# memory    : 64B β†’ 8B  (8Γ— smaller)

Asymmetric INT4

python
def quantize_int4(weights):
    w_min  = weights.min()
    w_max  = weights.max()
    scale  = (w_max - w_min) / 15
    zp     = int(np.round(-w_min / scale))
    q      = np.clip(np.round(weights / scale) + zp, 0, 15)
    return q.astype(np.int8), scale, zp

def dequantize_int4(q, scale, zp):
    return (q.astype(np.float32) - zp) * scale

q4, s4, zp4 = quantize_int4(weights)
r4          = dequantize_int4(q4, s4, zp4)

# scale = 0.2093,  zero_point = 9
# q     = [11  2 13  8 15  5  9  0]
# max error = 0.098667

Per-Group INT4

python
def quantize_per_group(weights, group_size=4):
    n      = len(weights)
    q_out  = np.zeros(n, dtype=np.int8)
    dq_out = np.zeros(n, dtype=np.float32)

    for g in range(n // group_size):
        lo    = g * group_size
        hi    = lo + group_size
        chunk = weights[lo:hi]

        mn    = chunk.min()
        mx    = chunk.max()
        scale = (mx - mn) / 15
        zp    = int(np.round(-mn / scale))

        q           = np.clip(np.round(chunk / scale) + zp, 0, 15)
        dq          = (q.astype(np.float32) - zp) * scale
        q_out[lo:hi]  = q
        dq_out[lo:hi] = dq

    return q_out, dq_out

_, dq_g = quantize_per_group(weights, group_size=4)

print(f"per-tensor max error: {np.max(np.abs(weights - r4)):.4f}")  # 0.0987
print(f"per-group  max error: {np.max(np.abs(weights - dq_g)):.4f}")  # 0.0773

Machine Epsilon

python
for dtype, name in [(np.float32, 'fp32'), (np.float16, 'fp16')]:
    eps = np.finfo(dtype).eps
    nxt = np.nextafter(dtype(1.0), dtype(2.0))
    gap = nxt - dtype(1.0)
    print(f"{name}:  Ξ΅ = {eps:.2e}   gap after 1.0 = {gap:.2e}")

# fp32:  Ξ΅ = 1.19e-07   gap after 1.0 = 1.19e-07
# fp16:  Ξ΅ = 9.77e-04   gap after 1.0 = 9.77e-04

v = 1.23456789
print(f"fp32: {np.float32(v):.8f}   error: {abs(v - float(np.float32(v))):.2e}")
print(f"fp16: {np.float16(v):.8f}   error: {abs(v - float(np.float16(v))):.2e}")
# fp32: 1.23456788   error: 9.37e-09
# fp16: 1.23437500   error: 1.93e-04

7. Types of Quantization

By timing

Post-Training Quantization (PTQ) β€” compress after training.

markdown
trained fp32 model
        ↓
  ~128 calibration samples
        ↓
  measure activation ranges
        ↓
  compute scale + zero-point
        ↓
  quantized model

βœ… Fast. No retraining needed.
⚠️ Some accuracy loss, especially below int8.


Quantization-Aware Training (QAT) β€” simulate quantization during training.

markdown
training loop:
  forward pass
        ↓
  fake-quantize weights   ← round then unround (simulate int4)
        ↓
  compute loss
        ↓
  backward pass           ← gradients still flow
        ↓
  update weights
        ↓  (repeat)
  export truly quantized model

βœ… Best accuracy at int4/int2.
⚠️ Requires full retraining.


By granularity

Per-tensor β€” one scale for the whole matrix.

markdown
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  scale = 0.021       β”‚
β”‚  0.32  –1.54  0.87  β”‚
β”‚ –0.21   1.23 –0.76  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Simple. Fastest. Least accurate.


Per-channel β€” one scale per row.

markdown
sβ‚€ β†’ β”‚  0.32  –1.54   0.87  β”‚
s₁ β†’ β”‚ –0.21   1.23  –0.76  β”‚
sβ‚‚ β†’ β”‚  0.05  –1.91   0.44  β”‚

Standard for int8. Noticeably better accuracy.


Per-group β€” one scale per N values within a row.

markdown
row: [0.32, –1.54, 0.87, –0.21 | 1.23, –0.76, 0.05, –1.91]
      ─── group 0 (s=0.161) ────   ─── group 1 (s=0.209) ───

Best accuracy. Used by GPTQ, AWQ, GGUF. Common sizes: 32, 64, 128.


By target

TargetWhat's quantizedMain benefit
Weight-onlyParameters onlyMemory savings
Weight + activationParameters + layer outputsFull compute speedup

Weights are easy β€” stable distributions, computed once.
Activations are harder β€” change with every input, prone to outliers.


8. How a Real Model Gets Quantized

End-to-end with GPTQ (most common PTQ for LLMs).

Step 1 β€” Load in fp16

python
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype="float16",
    device_map="auto",
)
# VRAM: ~14 GB

Step 2 β€” Calibration data

python
samples = [
    tokenizer(s["text"], return_tensors="pt",
              max_length=512, truncation=True)
    for s, _ in zip(dataset, range(128))
]

128 samples reveals how activations are distributed in real use.

Step 3 β€” Layer-by-layer quantization

markdown
for each weight matrix:
  1. run calibration β†’ observe input activations
  2. compute Hessian  (which weights matter most)
  3. quantize column by column
  4. compensate remaining columns for error introduced
  5. store: int4 weights + fp16 scales + int8 zero-points

Step 4 is the key insight β€” not naive rounding.

Step 4 β€” Mixed precision

Not every layer quantizes equally well:

markdown
Layer             Precision   Reason
────────────────────────────────────────
Embeddings        fp16        critical, small
First attention   int8        seen on every token
Middle layers     int4        bulk of parameters
LM head           fp16        picks next token directly

This is why GGUF filenames encode the strategy:

markdown
Q4_K_M.gguf

Q4  = 4-bit weights
K   = k-quants (smarter grouping)
M   = medium (some layers at higher precision)

Other variants:
  Q8_0    = 8-bit, best quality
  Q4_K_S  = more int4, smaller file
  Q2_K    = 2-bit, smallest, worst quality

Step 5 β€” Save and run

python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

config  = BaseQuantizeConfig(bits=4, group_size=128)
model_q = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3-8B", config)
model_q.quantize(samples)
model_q.save_quantized("llama-3-8b-gptq-int4")
# Saved: ~4.5 GB  (was 14 GB)

9. Memory Requirements

Model weights

markdown
Memory (GB) = params (B) Γ— bytes per weight

           fp32   fp16   int8   int4
           ────   ────   ────   ────
1B:         4      2      1     0.5
7B:        28     14      7     3.5
13B:       52     26     13     6.5
70B:      280    140     70      35

KV cache

markdown
KV cache =
    2                   (key + value)
  Γ— num_layers
  Γ— num_kv_heads
  Γ— head_dim
  Γ— context_length
  Γ— bytes_per_element

Llama 3 8B, ctx=8192, fp16:
  2 Γ— 32 Γ— 8 Γ— 128 Γ— 8192 Γ— 2 β‰ˆ 8 GB

A 7B int4 model (3.5 GB weights) at 8K context needs ~12 GB total.

Training memory

markdown
weights      4 bytes
gradients    4 bytes
Adam m       4 bytes
Adam v       4 bytes
─────────────────────
total       16 bytes / parameter

7B:   7B Γ— 16 = 112 GB
13B: 13B Γ— 16 = 208 GB

GPU guide

markdown
GPU                VRAM    Max model (int4)
──────────────────────────────────────────
RTX 3080          10 GB   7B  (tight)
RTX 3090/4090     24 GB   13B (comfortable)
A100 40GB         40 GB   30B (comfortable)
A100 80GB         80 GB   70B (comfortable)
2Γ— A100 80GB     160 GB   70B in fp16

10. Tradeoffs

Accuracy vs bits

markdown
Bits   Compression   Accuracy loss
─────────────────────────────────────
fp16      2Γ—         < 0.1%  (imperceptible)
int8      4Γ—         < 1%    (rarely noticeable)
int4      8Γ—         1–5%    (noticeable on hard tasks)
int2     16Γ—         10–30%  (often unusable)

Loss is worse on: reasoning tasks, long contexts, smaller models.


Speed: it's bandwidth, not compute

LLM inference is memory-bandwidth bound β€” the GPU spends most of its time reading weights, not multiplying them.

markdown
int4 = 8Γ— fewer bytes to read
     β†’ 2–4Γ— real speedup (after overhead)

For int8 compute speedup you need compatible tensor cores:

markdown
Supported:    NVIDIA A100, RTX 30xx/40xx, Apple M-series
Not supported: older GPUs  (int8 may actually be slower)

Granularity vs overhead

MethodAccuracyMemory overhead
Per-tensorlowest2 values total
Per-channelbetter2 per row
Per-group g=128good~3–5%
Per-group g=32best~12%

The outlier problem

In transformers, ~0.1% of activation values are 100Γ— larger than the rest.

markdown
normal: [0.10, 0.18, 0.15, 0.20, ...]
outlier: [0.10, 0.18, 127.4, 0.20, ...]
                       ↑
           forces scale = 127.4 / 127 = 1.003

now:  0.10 β†’ round(0.10 / 1.003) = 0
      0.18 β†’ round(0.18 / 1.003) = 0
      0.20 β†’ round(0.20 / 1.003) = 0

99.9% of values collapse to zero. Precision destroyed.

Solutions:

MethodApproach
LLM.int8()outlier channels stay fp16, rest in int8
SmoothQuantmove outlier magnitude from activations to weights
AWQprotect weights linked to outlier channels

11. Tools You'll Actually Use

bitsandbytes β€” easiest start

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NormalFloat4, better than int4
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 after dequant
    bnb_4bit_use_double_quant=True,         # quantize the scales too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb,
    device_map="auto",
)

llama.cpp / Ollama β€” local inference

bash
ollama run llama3   # picks best quant for your hardware automatically
python
from llama_cpp import Llama

llm = Llama(
    model_path="llama-3-8b.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,   # offload all layers to GPU
)
out = llm("What is quantization?", max_tokens=200)

AWQ β€” best accuracy at int4

python
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
model.quantize(
    tokenizer,
    quant_config={"zero_point": True, "q_group_size": 128, "w_bit": 4},
)
model.save_quantized("llama-3-8b-awq")

model = AutoAWQForCausalLM.from_quantized("llama-3-8b-awq", fuse_layers=True)

Choose your tool

GoalToolFormat
Quickest startbitsandbytesin-memory
Best int4 accuracyAWQ or GPTQ.safetensors
Local / offlinellama.cpp, Ollama.gguf
Production NVIDIATensorRT-LLMTRT engine
Mobile / edgeTFLite (QAT).tflite

12. Summary

The core idea

Map a continuous float range onto a discrete integer grid.
Store the index. Recover with a scale factor.
Error = half a grid step. Fewer bits = wider steps = more error.

The two formulas

Quantize:

markdown
q = clamp( round(w / scale) + zero_point,  min_int,  max_int )

Dequantize:

markdown
wΜƒ = (q – zero_point) Γ— scale

Symmetric (for weights):

markdown
zero_point = 0
scale      = abs_max / max_int

Asymmetric (for activations):

markdown
scale      = (max – min) / (2^bits – 1)
zero_point = round(–min / scale)

Three decisions every quantization involves

markdown
1. Bit width

   int4 ←─────────────────────→ fp32
   small / fast / lossy          large / slow / lossless

2. Granularity

   per-tensor ←───────────────→ per-group (g=32)
   simple                        accurate

3. Timing

   PTQ ←──────────────────────→ QAT
   no retraining                 full retraining

Quick use-case guide

SituationRecommendation
Run 7B on a laptopint4 GGUF via Ollama
Run 70B on 1–2 GPUsint4 GPTQ or AWQ
Fine-tune a quantized modelQLoRA (LoRA on int4 base)
Production NVIDIA servingint8 TensorRT
Train from scratch, low VRAMbf16 + gradient checkpointing

Further Reading


All calculations verified by hand and in NumPy.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment