Getting Started with Hugging Face Transformers
The Hugging Face transformers library is the standard Python tool for loading and running any open-source language model. It abstracts away the complexity of tokenization, model architecture, and inference, letting you load a 7B or 70B parameter model and generate text in just three lines of code.
This tutorial walks you through installation, loading your first model, understanding the difference between pipelines and low-level APIs, and debugging common issues. By the end, you'll be able to load any model from the Hugging Face Hub and customize inference for your application.
Installation and Setup
Install the transformers library with PyTorch or TensorFlow as the backend:
# For GPU (CUDA 12.1+)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers
# For CPU only
pip install torch
pip install transformers
# For Apple Silicon (Metal acceleration)
pip install torch # Uses Metal automatically
pip install transformers
Verify the installation:
import torch
import transformers
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
On a system with a GPU, the output shows your GPU name (e.g., NVIDIA RTX 4090). On CPU-only systems, GPU available: False is expected.
The Simplest Way: Pipelines
The pipeline API is the fastest way to run inference. It handles tokenization, model loading, and post-processing automatically:
from transformers import pipeline
# Load a text-generation pipeline
generator = pipeline("text-generation", model="mistral-community/Mistral-7B-Instruct-v0.3")
# Run inference
prompt = "Explain quantum computing in simple terms."
result = generator(prompt, max_length=150, temperature=0.7)
print(result[0]["generated_text"])
The pipeline function automatically:
- Downloads the model (1–14 GB depending on size) on first run
- Loads it into memory (or GPU if available)
- Tokenizes your input text
- Runs the model forward pass
- Decodes the output tokens back to text
Available pipeline tasks:
"text-generation"— Generate text (causal LMs)"text2text-generation"— Seq2Seq models (summarization, translation)"question-answering"— Extract answers from context"fill-mask"— Fill in[MASK]tokens (BERT-style)"feature-extraction"— Get embedding vectors
Low-Level Control: AutoTokenizer and AutoModel
Pipelines are convenient but limit customization. For fine-grained control over inference, use AutoTokenizer and AutoModelForCausalLM:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Prepare input
prompt = "In Python, how do I read a file?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generate output
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_length=100,
temperature=0.7,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# Decode
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(response)
Key parameters:
torch_dtype=torch.float16— Use 16-bit precision (faster, less VRAM than float32)max_length— Maximum tokens to generate (input + output)temperature— Randomness (0 = deterministic, 1+ = more random)top_p— Nucleus sampling (keep top 95% probability mass)do_sample=True— Use sampling instead of greedy decoding
Loading Different Model Types
Hugging Face hosts models for different tasks. Pick the right class:
# Causal LM (generate text left-to-right, e.g., GPT style)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3")
# Masked LM (fill masks, e.g., BERT)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
# Sequence-to-Sequence (summarization, translation)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
# Question-Answering
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
Controlling Memory Usage
Models can consume 14–160+ GB of VRAM depending on size. Three strategies reduce memory:
1. Use smaller models (3B–13B instead of 70B):
# Use Phi-3, a tiny 3B model
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
# Loads in ~6 GB VRAM (vs. Llama-70B at 140 GB)
2. Load in 8-bit or 4-bit precision:
pip install bitsandbytes
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
load_in_8bit=True,
device_map="auto"
)
# 7B model now uses ~7 GB instead of 14 GB
3. Use Flash Attention for speed and lower memory:
model = AutoModelForCausalLM.from_pretrained(
"mistral-community/Mistral-7B-Instruct-v0.3",
attn_implementation="flash_attention_2" # Requires PyTorch 2.0+
)
Batching for Higher Throughput
To process multiple prompts at once, batch them:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
prompts = [
"What is Python?",
"Explain quantum computing.",
"How do neural networks work?"
]
# Set padding token for batching
tokenizer.pad_token = tokenizer.eos_token
# Tokenize all at once
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
# Generate for all prompts in parallel
output_ids = model.generate(**inputs, max_length=100)
# Decode all
responses = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}")
print(f"A: {response}\n")
Batching processes 3 prompts nearly 3 faster than sequential inference.
Debugging and Troubleshooting
Issue: Out-of-Memory (OOM) Error
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
Solution: Reduce max_length, use load_in_8bit=True, or switch to a smaller model.
Issue: Model doesn't exist or download fails
# Check if model exists
from transformers import model_info
info = model_info("mistral-community/Mistral-7B-Instruct-v0.3")
print(info)
Issue: Slow inference on CPU
CPU inference of 7B models is inherently slow (10–30 tokens/second). Use a smaller model (3B) or invest in a GPU.
Key Takeaways
pipeline()is the fastest way to get started;AutoTokenizerandAutoModeloffer fine-grained control.- Always specify
torch_dtype=torch.float16to use less VRAM without sacrificing quality. - Load models on GPU with
.to(device)for 10–50 speed-up over CPU. - Batch multiple prompts to increase throughput by up to 3.
- Quantization (8-bit or 4-bit) reduces memory by 50–75% with minimal quality loss.
Frequently Asked Questions
How do I specify which GPU to use?
Set the CUDA_VISIBLE_DEVICES environment variable before loading the model, or use device_map="auto" in the from_pretrained() call.
Can I use multiple GPUs?
Yes. Use device_map="auto" or manual distributed inference with torch.nn.DataParallel. For production, use vLLM or TGI which handle multi-GPU automatically.
What models should I start with?
For learning: Phi-3 (3B) — runs on CPU, fast iteration. For production: Mistral-7B — balanced quality and speed. For maximum quality: Llama-2-70B or Qwen-72B — require high-end GPU or cluster.
How do I save a model locally to avoid repeated downloads?
model.save_pretrained("/local/path/to/model")
tokenizer.save_pretrained("/local/path/to/model")
# Later, load from disk
model = AutoModelForCausalLM.from_pretrained("/local/path/to/model")
Does transformers support inference on CPU efficiently?
Not optimally. PyTorch CPU inference is single-threaded by default. For fast CPU inference, use num_threads=8 in the pipeline or switch to ONNX Runtime, which is 2–4 faster.
Further Reading
- Hugging Face Transformers Documentation — Official API reference.
- Model Hub — 500,000+ models to download.
- Flash Attention 2 Paper — Faster attention mechanism.
- Quantization via BitsAndBytes — 4-bit and 8-bit loading.