~/blog
Quantization - Making Giant AI Models Fit in the Real World
From zero to practical β with hand math, Python, and honest tradeoffs.
1. What Is Quantization?
You have the number 3.14159265358979.
You only have space for two decimal places. So you write 3.14.
That's quantization β deliberately using less precision to save space, accepting a tiny error.
In AI: model weights stored as 32-bit floats get re-represented with fewer bits (16, 8, or 4). Less memory. Faster inference. Small accuracy cost.
Why it matters:
| Model | fp32 | int4 | Savings |
|---|---|---|---|
| Llama 3 8B | 32 GB | 4 GB | 8Γ |
| Llama 3 70B | 280 GB | 35 GB | 8Γ |
| GPT-3 175B | 700 GB | 87 GB | 8Γ |
A 70B model at fp32 needs 4Γ A100 GPUs to load.
At int4, it fits on a single RTX 4090.
2. How Numbers Live in Memory
Every number is stored as bits β 0s and 1s.
Integers
INT8 = 8 bits = 2βΈ = 256 possible values (β128 to +127)
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ
βSβ magnitude β
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ
β
sign bitGaps between values are always equal β uniform spacing:
β3 β2 β1 0 1 2 3
| | | | | | |
βββ equal gaps everywhere βββFloating Point
FP32 = 32 bits, split into 3 fields:
ββββ¬βββββββββββ¬βββββββββββββββββββββββ
βS β Exponent β Mantissa β
β1bβ 8 bits β 23 bits β
ββββ΄βββββββββββ΄βββββββββββββββββββββββ
β β β
+/β which range the digitsValue formula:
value = (β1)^S Γ 1.mantissa Γ 2^(exponent β 127)The exponent picks a power-of-2 window.
The mantissa fills in the digits inside that window.
Gaps are small near zero, large far from zero:
Near 0: |||||||||||||||| β dense
Near 1M: | | β sparseFloat formats at a glance
| Format | Bits | Exponent | Mantissa | ~Digits |
|---|---|---|---|---|
| fp32 | 32 | 8 | 23 | 7 |
| bf16 | 16 | 8 | 7 | 2 |
| fp16 | 16 | 5 | 10 | 3 |
| fp8 | 8 | 4β5 | 3β4 | 1 |
| int8 | 8 | β | β | integers |
| int4 | 4 | β | β | integers |
bf16 vs fp16: Both are 16-bit, different tradeoff.
bf16 keeps fp32's exponent range β no overflow during training.
fp16 keeps more mantissa bits β more precise, but narrower range.
3. What "Precision" Actually Means
Precision = how many distinct values you can represent.
More mantissa bits β finer grid β higher precision.
The formal measure: machine epsilon (Ξ΅) β the gap between 1.0 and the next representable number.
fp32: Ξ΅ β 1.19 Γ 10β»β· β ~7 decimal digits
fp16: Ξ΅ β 9.77 Γ 10β»β΄ β ~3 decimal digits
bf16: Ξ΅ β 7.81 Γ 10β»Β³ β ~2 decimal digitsfp16 is ~8,000Γ coarser than fp32 at the same magnitude.
What makes it "floating": relative precision stays roughly constant everywhere.
value abs gap rel gap
ββββββββββ βββββββββ ββββββββ
0.001 1.16e-10 ~1.2e-7
1.0 1.19e-07 ~1.2e-7
1,000,000 6.25e-02 ~6.3e-8The absolute gap at 1,000,000 is 0.0625.
You cannot represent any value between 1000000.0 and 1000000.0625.
4. Why AI Models Are a Memory Problem
Every weight is a number in memory.
Memory = num_parameters Γ bytes_per_weightBytes per weight:
fp32 β 4 bytes
fp16 β 2 bytes
int8 β 1 byte
int4 β 0.5 bytesA 7B model:
fp32 : 7B Γ 4 = 28 GB
fp16 : 7B Γ 2 = 14 GB
int8 : 7B Γ 1 = 7 GB
int4 : 7B Γ 0.5 = 3.5 GBThat's just the weights. Total inference memory breakdown:
| Component | Typical Size | Notes |
|---|---|---|
| Model weights | 1β4 bytes/param | Depends on precision (fp16, int8, int4) |
| KV cache | 2β10 Γ weights | Scales with context length; major memory hog |
| Activations | Variable | Depends on batch size and model size |
| CUDA overhead | ~1β2 GB | GPU runtime requirements |
Total β weights + KV cache + activations + CUDA overhead
Training memory is far worse:
weights 4 bytes
gradients 4 bytes
Adam m (1st) 4 bytes
Adam v (2nd) 4 bytes
βββββββββββββββββββββββββ
total 16 bytes / parameter
7B model training: 7B Γ 16 = 112 GBThis is why quantization went from optional to essential.
5. Quantization by Hand
Our example weights:
w = [0.32, β1.54, 0.87, β0.21, 1.23, β0.76, 0.05, β1.91]A β Symmetric INT8
Symmetric = centered at zero. Range: [β127, +127]. No offset.
Step 1 β Find the largest absolute value
abs_max = max(|0.32|, |β1.54|, |0.87|, ...)
= 1.91Step 2 β Compute scale
scale = abs_max / 127
= 1.91 / 127
= 0.01504Step 3 β Quantize q = round(w / scale)
0.32 / 0.01504 = 21.28 β 21
β1.54 / 0.01504 = β102.4 β β102
0.87 / 0.01504 = 57.85 β 58
β0.21 / 0.01504 = β13.96 β β14
1.23 / 0.01504 = 81.78 β 82
β0.76 / 0.01504 = β50.53 β β51
0.05 / 0.01504 = 3.32 β 3
β1.91 / 0.01504 = β127.0 β β127These integers are stored. 1 byte each instead of 4.
Step 4 β Dequantize wΜ = q Γ scale
21 Γ 0.01504 = 0.3158 original 0.32 Ξ = +0.004
β102 Γ 0.01504 = β1.5341 original β1.54 Ξ = β0.006
58 Γ 0.01504 = 0.8723 original 0.87 Ξ = β0.002
β14 Γ 0.01504 = β0.2106 original β0.21 Ξ = β0.001
82 Γ 0.01504 = 1.2333 original 1.23 Ξ = β0.003
β51 Γ 0.01504 = β0.7670 original β0.76 Ξ = +0.007
3 Γ 0.01504 = 0.0451 original 0.05 Ξ = +0.005
β127 Γ 0.01504 = β1.9101 original β1.91 Ξ = 0.000Max error: 0.007 β barely noticeable.
B β Asymmetric INT4
Asymmetric = uses a zero-point to shift the grid.
Range: [0, 15] β only 16 levels.
Step 1 β Find min and max
min = β1.91
max = 1.23Step 2 β Compute scale
scale = (max β min) / (2β΄ β 1)
= (1.23 β (β1.91)) / 15
= 3.14 / 15
= 0.2093Step 3 β Compute zero-point
zero_point = round(βmin / scale)
= round(1.91 / 0.2093)
= round(9.12)
= 9Float 0.0 now maps to integer 9. The grid covers [β1.88, +1.26].
Step 4 β Quantize q = clamp(round(w / scale) + zp, 0, 15)
0.32: round( 1.53) + 9 = 2 + 9 = 11
β1.54: round(β7.36) + 9 = β7 + 9 = 2
0.87: round( 4.16) + 9 = 4 + 9 = 13
β0.21: round(β1.00) + 9 = β1 + 9 = 8
1.23: round( 5.88) + 9 = 6 + 9 = 15
β0.76: round(β3.63) + 9 = β4 + 9 = 5
0.05: round( 0.24) + 9 = 0 + 9 = 9
β1.91: round(β9.12) + 9 = β9 + 9 = 0Step 5 β Dequantize wΜ = (q β zp) Γ scale
(11β9) Γ 0.2093 = 0.4187 original 0.32 Ξ = β0.099
( 2β9) Γ 0.2093 = β1.4651 original β1.54 Ξ = β0.075
(13β9) Γ 0.2093 = 0.8372 original 0.87 Ξ = +0.033
( 8β9) Γ 0.2093 = β0.2093 original β0.21 Ξ = +0.001
(15β9) Γ 0.2093 = 1.2558 original 1.23 Ξ = β0.026
( 5β9) Γ 0.2093 = β0.8372 original β0.76 Ξ = +0.077
( 9β9) Γ 0.2093 = 0.0000 original 0.05 Ξ = +0.050
( 0β9) Γ 0.2093 = β1.8837 original β1.91 Ξ = +0.026Max error: 0.099 β 14Γ worse than INT8.
Why? INT8 has 256 levels, INT4 has 16. Each step is 14Γ wider.
INT8 step: 1.91 / 127 = 0.015 (fine)
INT4 step: 3.14 / 15 = 0.209 (coarse)C β Per-Group INT4
Instead of one scale for all 8 values, compute one scale per group of 4.
Group 0: [0.32, β1.54, 0.87, β0.21] β scale = 0.161, zp = 10
Group 1: [1.23, β0.76, 0.05, β1.91] β scale = 0.209, zp = 9Each group adapts its scale to its local range β fewer wasted levels.
Per-tensor: max error = 0.099
Per-group: max error = 0.077 (1.3Γ better)In real models with 128+ values per group, the improvement is much larger.
6. Quantization in Python
Symmetric INT8
import numpy as np
weights = np.array([0.32, -1.54, 0.87, -0.21,
1.23, -0.76, 0.05, -1.91])
# --- quantize ---
abs_max = np.max(np.abs(weights))
scale = abs_max / 127.0
q = np.round(weights / scale).astype(np.int8)
# scale = 0.015039
# q = [ 21 -102 58 -14 82 -51 3 -127]
# --- dequantize ---
recovered = q.astype(np.float32) * scale
errors = weights - recovered
print(f"max error : {np.max(np.abs(errors)):.6f}") # 0.007008
print(f"mean error: {np.mean(np.abs(errors)):.6f}") # 0.003514
print(f"memory : {weights.nbytes}B β {q.nbytes}B ({weights.nbytes//q.nbytes}Γ smaller)")
# memory : 64B β 8B (8Γ smaller)Asymmetric INT4
def quantize_int4(weights):
w_min = weights.min()
w_max = weights.max()
scale = (w_max - w_min) / 15
zp = int(np.round(-w_min / scale))
q = np.clip(np.round(weights / scale) + zp, 0, 15)
return q.astype(np.int8), scale, zp
def dequantize_int4(q, scale, zp):
return (q.astype(np.float32) - zp) * scale
q4, s4, zp4 = quantize_int4(weights)
r4 = dequantize_int4(q4, s4, zp4)
# scale = 0.2093, zero_point = 9
# q = [11 2 13 8 15 5 9 0]
# max error = 0.098667Per-Group INT4
def quantize_per_group(weights, group_size=4):
n = len(weights)
q_out = np.zeros(n, dtype=np.int8)
dq_out = np.zeros(n, dtype=np.float32)
for g in range(n // group_size):
lo = g * group_size
hi = lo + group_size
chunk = weights[lo:hi]
mn = chunk.min()
mx = chunk.max()
scale = (mx - mn) / 15
zp = int(np.round(-mn / scale))
q = np.clip(np.round(chunk / scale) + zp, 0, 15)
dq = (q.astype(np.float32) - zp) * scale
q_out[lo:hi] = q
dq_out[lo:hi] = dq
return q_out, dq_out
_, dq_g = quantize_per_group(weights, group_size=4)
print(f"per-tensor max error: {np.max(np.abs(weights - r4)):.4f}") # 0.0987
print(f"per-group max error: {np.max(np.abs(weights - dq_g)):.4f}") # 0.0773Machine Epsilon
for dtype, name in [(np.float32, 'fp32'), (np.float16, 'fp16')]:
eps = np.finfo(dtype).eps
nxt = np.nextafter(dtype(1.0), dtype(2.0))
gap = nxt - dtype(1.0)
print(f"{name}: Ξ΅ = {eps:.2e} gap after 1.0 = {gap:.2e}")
# fp32: Ξ΅ = 1.19e-07 gap after 1.0 = 1.19e-07
# fp16: Ξ΅ = 9.77e-04 gap after 1.0 = 9.77e-04
v = 1.23456789
print(f"fp32: {np.float32(v):.8f} error: {abs(v - float(np.float32(v))):.2e}")
print(f"fp16: {np.float16(v):.8f} error: {abs(v - float(np.float16(v))):.2e}")
# fp32: 1.23456788 error: 9.37e-09
# fp16: 1.23437500 error: 1.93e-047. Types of Quantization
By timing
Post-Training Quantization (PTQ) β compress after training.
trained fp32 model
β
~128 calibration samples
β
measure activation ranges
β
compute scale + zero-point
β
quantized modelβ
Fast. No retraining needed.
β οΈ Some accuracy loss, especially below int8.
Quantization-Aware Training (QAT) β simulate quantization during training.
training loop:
forward pass
β
fake-quantize weights β round then unround (simulate int4)
β
compute loss
β
backward pass β gradients still flow
β
update weights
β (repeat)
export truly quantized modelβ
Best accuracy at int4/int2.
β οΈ Requires full retraining.
By granularity
Per-tensor β one scale for the whole matrix.
ββββββββββββββββββββββββ
β scale = 0.021 β
β 0.32 β1.54 0.87 β
β β0.21 1.23 β0.76 β
ββββββββββββββββββββββββSimple. Fastest. Least accurate.
Per-channel β one scale per row.
sβ β β 0.32 β1.54 0.87 β
sβ β β β0.21 1.23 β0.76 β
sβ β β 0.05 β1.91 0.44 βStandard for int8. Noticeably better accuracy.
Per-group β one scale per N values within a row.
row: [0.32, β1.54, 0.87, β0.21 | 1.23, β0.76, 0.05, β1.91]
βββ group 0 (s=0.161) ββββ βββ group 1 (s=0.209) βββBest accuracy. Used by GPTQ, AWQ, GGUF. Common sizes: 32, 64, 128.
By target
| Target | What's quantized | Main benefit |
|---|---|---|
| Weight-only | Parameters only | Memory savings |
| Weight + activation | Parameters + layer outputs | Full compute speedup |
Weights are easy β stable distributions, computed once.
Activations are harder β change with every input, prone to outliers.
8. How a Real Model Gets Quantized
End-to-end with GPTQ (most common PTQ for LLMs).
Step 1 β Load in fp16
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype="float16",
device_map="auto",
)
# VRAM: ~14 GBStep 2 β Calibration data
samples = [
tokenizer(s["text"], return_tensors="pt",
max_length=512, truncation=True)
for s, _ in zip(dataset, range(128))
]128 samples reveals how activations are distributed in real use.
Step 3 β Layer-by-layer quantization
for each weight matrix:
1. run calibration β observe input activations
2. compute Hessian (which weights matter most)
3. quantize column by column
4. compensate remaining columns for error introduced
5. store: int4 weights + fp16 scales + int8 zero-pointsStep 4 is the key insight β not naive rounding.
Step 4 β Mixed precision
Not every layer quantizes equally well:
Layer Precision Reason
ββββββββββββββββββββββββββββββββββββββββ
Embeddings fp16 critical, small
First attention int8 seen on every token
Middle layers int4 bulk of parameters
LM head fp16 picks next token directlyThis is why GGUF filenames encode the strategy:
Q4_K_M.gguf
Q4 = 4-bit weights
K = k-quants (smarter grouping)
M = medium (some layers at higher precision)
Other variants:
Q8_0 = 8-bit, best quality
Q4_K_S = more int4, smaller file
Q2_K = 2-bit, smallest, worst qualityStep 5 β Save and run
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
config = BaseQuantizeConfig(bits=4, group_size=128)
model_q = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3-8B", config)
model_q.quantize(samples)
model_q.save_quantized("llama-3-8b-gptq-int4")
# Saved: ~4.5 GB (was 14 GB)9. Memory Requirements
Model weights
Memory (GB) = params (B) Γ bytes per weight
fp32 fp16 int8 int4
ββββ ββββ ββββ ββββ
1B: 4 2 1 0.5
7B: 28 14 7 3.5
13B: 52 26 13 6.5
70B: 280 140 70 35KV cache
KV cache =
2 (key + value)
Γ num_layers
Γ num_kv_heads
Γ head_dim
Γ context_length
Γ bytes_per_element
Llama 3 8B, ctx=8192, fp16:
2 Γ 32 Γ 8 Γ 128 Γ 8192 Γ 2 β 8 GBA 7B int4 model (3.5 GB weights) at 8K context needs ~12 GB total.
Training memory
weights 4 bytes
gradients 4 bytes
Adam m 4 bytes
Adam v 4 bytes
βββββββββββββββββββββ
total 16 bytes / parameter
7B: 7B Γ 16 = 112 GB
13B: 13B Γ 16 = 208 GBGPU guide
GPU VRAM Max model (int4)
ββββββββββββββββββββββββββββββββββββββββββ
RTX 3080 10 GB 7B (tight)
RTX 3090/4090 24 GB 13B (comfortable)
A100 40GB 40 GB 30B (comfortable)
A100 80GB 80 GB 70B (comfortable)
2Γ A100 80GB 160 GB 70B in fp1610. Tradeoffs
Accuracy vs bits
Bits Compression Accuracy loss
βββββββββββββββββββββββββββββββββββββ
fp16 2Γ < 0.1% (imperceptible)
int8 4Γ < 1% (rarely noticeable)
int4 8Γ 1β5% (noticeable on hard tasks)
int2 16Γ 10β30% (often unusable)Loss is worse on: reasoning tasks, long contexts, smaller models.
Speed: it's bandwidth, not compute
LLM inference is memory-bandwidth bound β the GPU spends most of its time reading weights, not multiplying them.
int4 = 8Γ fewer bytes to read
β 2β4Γ real speedup (after overhead)For int8 compute speedup you need compatible tensor cores:
Supported: NVIDIA A100, RTX 30xx/40xx, Apple M-series
Not supported: older GPUs (int8 may actually be slower)Granularity vs overhead
| Method | Accuracy | Memory overhead |
|---|---|---|
| Per-tensor | lowest | 2 values total |
| Per-channel | better | 2 per row |
| Per-group g=128 | good | ~3β5% |
| Per-group g=32 | best | ~12% |
The outlier problem
In transformers, ~0.1% of activation values are 100Γ larger than the rest.
normal: [0.10, 0.18, 0.15, 0.20, ...]
outlier: [0.10, 0.18, 127.4, 0.20, ...]
β
forces scale = 127.4 / 127 = 1.003
now: 0.10 β round(0.10 / 1.003) = 0
0.18 β round(0.18 / 1.003) = 0
0.20 β round(0.20 / 1.003) = 0
99.9% of values collapse to zero. Precision destroyed.Solutions:
| Method | Approach |
|---|---|
| LLM.int8() | outlier channels stay fp16, rest in int8 |
| SmoothQuant | move outlier magnitude from activations to weights |
| AWQ | protect weights linked to outlier channels |
11. Tools You'll Actually Use
bitsandbytes β easiest start
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4, better than int4
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16 after dequant
bnb_4bit_use_double_quant=True, # quantize the scales too
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb,
device_map="auto",
)llama.cpp / Ollama β local inference
ollama run llama3 # picks best quant for your hardware automaticallyfrom llama_cpp import Llama
llm = Llama(
model_path="llama-3-8b.Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=-1, # offload all layers to GPU
)
out = llm("What is quantization?", max_tokens=200)AWQ β best accuracy at int4
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
model.quantize(
tokenizer,
quant_config={"zero_point": True, "q_group_size": 128, "w_bit": 4},
)
model.save_quantized("llama-3-8b-awq")
model = AutoAWQForCausalLM.from_quantized("llama-3-8b-awq", fuse_layers=True)Choose your tool
| Goal | Tool | Format |
|---|---|---|
| Quickest start | bitsandbytes | in-memory |
| Best int4 accuracy | AWQ or GPTQ | .safetensors |
| Local / offline | llama.cpp, Ollama | .gguf |
| Production NVIDIA | TensorRT-LLM | TRT engine |
| Mobile / edge | TFLite (QAT) | .tflite |
12. Summary
The core idea
Map a continuous float range onto a discrete integer grid.
Store the index. Recover with a scale factor.
Error = half a grid step. Fewer bits = wider steps = more error.
The two formulas
Quantize:
q = clamp( round(w / scale) + zero_point, min_int, max_int )Dequantize:
wΜ = (q β zero_point) Γ scaleSymmetric (for weights):
zero_point = 0
scale = abs_max / max_intAsymmetric (for activations):
scale = (max β min) / (2^bits β 1)
zero_point = round(βmin / scale)Three decisions every quantization involves
1. Bit width
int4 βββββββββββββββββββββββ fp32
small / fast / lossy large / slow / lossless
2. Granularity
per-tensor βββββββββββββββββ per-group (g=32)
simple accurate
3. Timing
PTQ ββββββββββββββββββββββββ QAT
no retraining full retrainingQuick use-case guide
| Situation | Recommendation |
|---|---|
| Run 7B on a laptop | int4 GGUF via Ollama |
| Run 70B on 1β2 GPUs | int4 GPTQ or AWQ |
| Fine-tune a quantized model | QLoRA (LoRA on int4 base) |
| Production NVIDIA serving | int8 TensorRT |
| Train from scratch, low VRAM | bf16 + gradient checkpointing |
Further Reading
- GPTQ paper β accurate post-training int4
- AWQ paper β activation-aware weight quantization
- SmoothQuant β handling activation outliers
- LLM.int8() β bitsandbytes 8-bit method
- llama.cpp β GGUF reference implementation
- HF Quantization Guide β practical overview
All calculations verified by hand and in NumPy.