Python LLM Quantization: Making Models Smaller
Quantization converts neural network weights from 32-bit floating-point (4 bytes per value) to 8-bit (1 byte) or 4-bit (0.5 bytes) integers, reducing memory by 75–90% with minimal quality loss. A 7B parameter model that normally requires 14 GB of VRAM shrinks to 3.5 GB; a 70B model drops from 140 GB to 36 GB.
This tutorial covers quantization mechanics, the two main approaches (BitsAndBytes for training-friendly, GPTQ/AWQ for inference-only), and how to apply quantization in Python with code examples. By the end, you'll know when quantization is right for your use case.
Understanding Quantization
Quantization maps the continuous range of floating-point values to discrete integer buckets. In 4-bit quantization:
Float32: -0.4251 → 8-bit bucket: -15 (on scale -128 to 127)
Float32: +0.1872 → 8-bit bucket: +6
Float32: -0.0031 → 8-bit bucket: 0
Modern quantization preserves the model's ability to distinguish between these values using scale factors and zero-points (offsets). A 7B model's 7 billion weights compress from 28 GB (7B × 4 bytes) to 3.5 GB (7B × 0.5 bytes per 4-bit value), a 8 reduction.
The quality loss is small because neural networks are over-parameterized; dropping low-significance bits has minimal impact on outputs. Benchmarks show 4-bit quantized Llama-2-7B within 0–2% of the original on MMLU, HellaSwag, and other standard tasks (Hugging Face Leaderboard, 2026).
Two Quantization Methods: BitsAndBytes vs. GPTQ
| Aspect | BitsAndBytes (QLoRA) | GPTQ |
|---|---|---|
| Inference speed | 10–15% slower than FP16 | Same as FP16 (optimized) |
| VRAM for training | Very low (4-bit models trainable on single GPU) | Not trainable (inference-only) |
| Accuracy | ~1% drop at 4-bit | ~2–3% drop at 4-bit |
| Load time | Fast (weights already 4-bit) | Slower (requires calibration) |
| Best for | Fine-tuning on consumer hardware | Production inference |
BitsAndBytes (QLoRA): Loads a model in 4-bit on the fly, keeping activations in full precision. Slightly slower inference but enables low-rank adaptation (LoRA) fine-tuning.
GPTQ: Pre-quantizes weights offline using a calibration dataset, storing them in GPTQ format. Faster inference but can't fine-tune without full re-quantization.
BitsAndBytes 4-Bit Quantization
Install BitsAndBytes:
pip install bitsandbytes
Load a 7B model in 4-bit:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # Quantize the quantization scale
bnb_4bit_quant_type="nf4" # Normal float 4-bit
)
# Load model with 4-bit config
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Now run inference (only ~3.5 GB VRAM used)
prompt = "What is Python?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The device_map="auto" parameter automatically distributes the model across available GPUs/CPU. Key parameters:
bnb_4bit_compute_dtype=torch.float16— Use FP16 for intermediate computations (balance of speed and precision)bnb_4bit_use_double_quant=True— Quantize the quantization scales themselves (saves additional 0.4 bits per value)bnb_4bit_quant_type="nf4"— Normal float 4-bit, slightly better quality than regular int4
Memory usage for Mistral-7B: ~3.5 GB (vs. 14 GB unquantized, 5.2 GB in 8-bit). On a consumer GPU like RTX 4070 (12 GB), this leaves room for a batch of 4–8 examples.
8-Bit Quantization
8-bit quantization is lossless for most models; use it if you have memory to spare but want zero quality loss:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
load_in_8bit=True,
device_map="auto"
)
# Uses ~7 GB VRAM (vs. 14 GB unquantized)
This is roughly equivalent to BitsAndBytesConfig with load_in_8bit=True. It's faster and simpler than 4-bit but uses ~2 more VRAM.
GPTQ Pre-Quantized Models
Many models on Hugging Face Hub are pre-quantized in GPTQ format. These are ready to load without on-the-fly quantization overhead:
pip install auto-gptq
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load a pre-GPTQ model
model_name = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Run inference (very fast, already quantized)
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
GPTQ models are faster but have slightly lower quality (2–3% vs. 1% for BitsAndBytes). Use GPTQ for production inference; use BitsAndBytes if you plan to fine-tune.
Benchmarking: Quality vs. Speed Trade-off
Here's a representative benchmark on Mistral-7B on HellaSwag (higher is better):
FP16 (full precision): 83.2%
8-bit quantization: 83.1% (0.1% drop)
4-bit (nf4 + double quant): 82.8% (0.4% drop)
4-bit (standard): 82.1% (1.1% drop)
For most applications, users don't notice the 0.4–1% quality difference. Inference speed (tokens/second) improves slightly due to smaller memory bandwidth, but quantization is not primarily for speed; it's for fitting larger models into available VRAM.
Fine-Tuning with QLoRA
To fine-tune a quantized model, use the peft (Parameter Efficient Fine-Tuning) library with LoRA:
pip install peft
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
import torch
# Load model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config,
device_map="auto"
)
# Add LoRA adapters (low-rank updates)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank of LoRA update matrices
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"] # Which layers to fine-tune
)
model = get_peft_model(model, peft_config)
# Now fine-tune on your data (train() call omitted for brevity)
# LoRA adapters only add ~0.2% extra VRAM; base model stays frozen in 4-bit
This approach uses ~4 GB for the 4-bit model plus a small amount for LoRA gradients, making it possible to fine-tune a 7B model on a single consumer GPU (RTX 4070, 12 GB).
Common Pitfalls
Pitfall 1: Using wrong compute dtype
# Wrong: int4 compute (leads to numerical instability)
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.int8)
# Right: Use float16 or float32 for compute
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
Pitfall 2: Loading quantized model on CPU Quantized models are designed for GPU. Loading on CPU bypasses the speedup (defeats the purpose).
Pitfall 3: Not setting device_map
Always use device_map="auto" to avoid manual device placement errors.
Key Takeaways
- 4-bit quantization reduces model size by 75% (14 GB → 3.5 GB) with 0.4–1% quality loss.
- BitsAndBytes is best for training; GPTQ is best for inference-only workloads.
- Use 8-bit for zero quality loss; use 4-bit when VRAM is critical.
- QLoRA enables fine-tuning quantized models on consumer hardware.
- Pre-GPTQ models on Hugging Face avoid on-the-fly quantization overhead.
Frequently Asked Questions
Is quantized model output identical to full precision?
No, but nearly so. Quality difference is imperceptible for most tasks (0.4–1% on benchmarks). The trade-off is worth it for 4 reduction in VRAM.
Can I quantize a model to 2-bit or 3-bit?
Theoretically yes, but quality drops significantly (5–10%). Tools like GGML (used by Ollama) support 2-bit and 3-bit, but they're rarely better than 4-bit in practice.
How long does on-the-fly quantization take (BitsAndBytes)?
The first inference is slower due to quantization overhead (~10–15% speed penalty). Subsequent inferences run at normal speed. For production, pre-quantize models offline.
Can I mix quantization with multi-GPU?
Yes, device_map="auto" distributes quantized models across GPUs. On two RTX 4090s (48 GB total), you can run a 70B model in 4-bit.
What's the difference between nf4 and int4 quantization?
nf4 (normal float 4-bit) preserves the distribution of activations better than standard int4, resulting in ~0.3% higher quality. Use nf4 if available.
Further Reading
- BitsAndBytes GitHub — Official implementation and details.
- QLoRA Paper — Original research on 4-bit fine-tuning.
- GPTQ Paper — Post-training quantization method.
- Hugging Face Quantization Guide — Complete API reference.