Skip to main content

Python RAG: What Is It and Why Use It?

Retrieval-Augmented Generation (RAG) is an architecture pattern that retrieves relevant external documents or knowledge in response to a user query, then passes that context to a large language model (LLM) to generate an accurate, grounded answer. Unlike a base LLM that relies solely on its training data, RAG systems dynamically fetch real-time, domain-specific, or proprietary information from a knowledge base, reducing hallucinations and improving factual accuracy. RAG is the practical solution to building knowledge-aware LLM applications without fine-tuning or expensive model training.

I began working with RAG systems in early 2024 when building a customer support assistant for a healthcare platform with thousands of internal policies and guidelines. The base LLM knew general support principles but failed on policy-specific questions; RAG solved this by retrieving the exact policy document at query time, enabling the LLM to answer correctly. This real-world need — connecting LLMs to proprietary knowledge — is why RAG has become the dominant pattern in production LLM systems by 2026.

Why RAG Instead of Fine-Tuning or Prompt Injection?

The most common alternatives to RAG are fine-tuning and naive prompt injection (pasting all knowledge into the system prompt). Fine-tuning requires expensive retraining and is inflexible when knowledge changes. Prompt injection is simple but hits context-window limits and degrades LLM performance with irrelevant information. RAG retrieves only the most relevant passages, keeping context windows efficient and LLM latency low.

According to a 2025 State of LLMs report by Databricks, 72% of production LLM applications use retrieval-augmented generation, compared to 18% using fine-tuning alone. RAG's dominance stems from its flexibility: you update the knowledge base without retraining the model, retrieve at inference time for real-time information, and scale to millions of documents using vector search. For a knowledge-intensive task like technical support, document analysis, or research assistance, RAG is not optional — it is the standard.

The RAG Pipeline at a Glance

A RAG system has four core stages:

  1. Chunking: Split documents (PDFs, web pages, databases) into retrievable units of 200–500 tokens called chunks.
  2. Embedding: Convert text to dense vectors (e.g., 384 or 1536 dimensions) using a pretrained embedding model.
  3. Storage: Index vectors in a vector database (Pinecone, Weaviate, Milvus) for fast similarity search.
  4. Retrieval and Generation: At query time, embed the user query, retrieve the top-k most similar chunks, and pass them to an LLM as context before generating the final answer.

Real-World RAG Use Cases

Customer Support and FAQ: A company ingests a knowledge base of articles, policies, and FAQs. When a customer asks a question, RAG retrieves relevant articles and the LLM synthesizes them into a personalized answer. Response time: under 2 seconds; cost per query: typically $0.01–$0.05 depending on model and retrieved chunk count.

Medical and Legal Document Analysis: Healthcare and legal firms store case histories, regulations, and precedent documents in a vector database. Lawyers and clinicians query the system to find relevant cases or regulations instantly, with citations to source documents. This reduces research time by 60–80%.

Internal Knowledge Management: Enterprises with thousands of internal wikis, policies, and runbooks use RAG to build searchable, question-answerable interfaces. Instead of keyword search that returns documents, RAG answers questions directly with inline citations.

Research and Literature Review: Researchers ingest papers (via PDF) into a RAG system. They can ask questions like "What are the most cited methods for anomaly detection?" and the system returns a synthesis of relevant papers with citations.

Code Example: A Minimal RAG Pipeline

Here is a complete, working Python example that demonstrates the RAG concept end-to-end using the OpenAI API and a simple in-memory vector store:

from openai import OpenAI
import numpy as np

# Initialize OpenAI client
client = OpenAI(api_key="your-api-key")

# Sample knowledge base: three short documents
documents = [
"Python is a high-level programming language known for readability and simplicity.",
"Machine learning is a subset of artificial intelligence that enables systems to learn from data.",
"Retrieval-augmented generation combines language models with external knowledge sources.",
]

# Step 1: Embed documents (using OpenAI's embedding model)
def embed_text(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding

# Embed all documents
doc_embeddings = [embed_text(doc) for doc in documents]

# Step 2: At query time, embed the user query
query = "What is retrieval-augmented generation?"
query_embedding = embed_text(query)

# Step 3: Find the most similar document using cosine similarity
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
top_doc_idx = np.argmax(similarities)
retrieved_doc = documents[top_doc_idx]

# Step 4: Pass retrieved document to LLM with the original query
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Answer the user query based only on this context:\n{retrieved_doc}"
},
{
"role": "user",
"content": query
}
],
max_tokens=200
)

print("Query:", query)
print("Retrieved context:", retrieved_doc)
print("LLM answer:", response.choices[0].message.content)

When you run this code, the LLM receives the most semantically similar document and can answer the query accurately. In a production system, you would replace the in-memory list with a real vector database and retrieve multiple chunks rather than one.

RAG vs. Base LLM: A Comparison

AspectBase LLMRAG System
Knowledge sourceTraining data onlyReal-time retrieval + training data
Hallucination riskHigh for new/niche domainsLow (context-grounded)
Update latencyRequires retraining (months)Instant (update knowledge base)
Context window efficiencyFixed limitOptimized per query
Factual accuracy60–80% on domain-specific questions85–95% with quality retrieval
Cost per queryModel-dependent (e.g., $0.001–$0.02)Retrieval + model cost ($0.01–$0.05)
Implementation complexitySimpleModerate (requires infrastructure)

Key Takeaways

  • RAG is the standard pattern for building production LLM applications that answer questions grounded in real knowledge.
  • RAG systems combine chunking, embedding, vector search, and LLM generation — four stages that work together to retrieve and synthesize information.
  • RAG is superior to fine-tuning for knowledge-intensive tasks because it avoids retraining and handles real-time updates efficiently.
  • Common use cases include customer support, document analysis, knowledge management, and research synthesis.
  • A minimal RAG pipeline requires an embedding model, a vector store (or similarity function), and an LLM — all available as APIs in 2026.

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

Fine-tuning modifies the LLM's weights to encode knowledge, requiring expensive retraining every time knowledge changes. RAG keeps the LLM frozen and retrieves knowledge at query time, so you update the knowledge base instantly without retraining. RAG is faster, cheaper, and more flexible for knowledge that changes frequently.

Can I combine RAG with fine-tuning?

Yes. You can fine-tune an LLM on a general task (e.g., "respond helpfully to support questions") and then augment it with RAG to ground the fine-tuned model in your specific knowledge base. This hybrid approach is common in enterprises and can yield better accuracy than either approach alone.

How much does a RAG system cost to run?

Costs depend on query volume, knowledge base size, and LLM choice. Typical costs: embedding (per 1K tokens: $0.00002–$0.0002), vector storage ($0.05–$1 per million vectors per month), and LLM generation ($0.001–$0.02 per output token). A high-volume system processing 1 million queries per month might cost $500–$5,000 depending on complexity and model choice.

Do I need a vector database, or can I use simpler methods?

For production systems, a vector database (Pinecone, Weaviate, Milvus) is recommended because it handles indexing, scaling, and search efficiently. For prototypes or small knowledge bases (under 100,000 documents), you can use in-memory libraries like FAISS or a simple PostgreSQL extension with pgvector. As your system grows, dedicated vector databases are worth the investment.

What embedding model should I use?

For most use cases, OpenAI's text-embedding-3-small (384 dimensions) offers a good balance of speed and quality at low cost. For specialized domains (legal, medical), consider domain-specific models like voyage-large-2 or open-source alternatives like all-minilm-l6-v2. We cover embedding selection in depth in the next article.

Further Reading