Building RAG Pipelines: Retrieval-Augmented Generation in Python
RAG pipelines are the backbone of intelligent document Q&A systems. While the basics (retrieve → augment → generate) are simple, production systems require careful attention to retrieval quality, ranking, and context composition. This article covers advanced RAG patterns: multi-hop retrieval, reranking, query optimization, and evaluation.
I deployed a RAG system that looked great in demos but failed on real questions—retrieving irrelevant documents. After implementing multi-query expansion and reranking, accuracy jumped from 62% to 89%. The LLM was fine; the retrieval was the bottleneck.
Advanced Retrieval Strategies
Multi-Query Expansion: Generate multiple paraphrases of the user's query, then retrieve from all and deduplicate:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
# Expand query into multiple perspectives
multi_retriever = MultiQueryRetriever.from_llm_and_retriever(
llm=model,
retriever=retriever,
prompt="""Generate 3 alternative versions of the user's question
to retrieve relevant documents. Focus on synonyms and rephrasing."""
)
docs = multi_retriever.invoke("What's the best pattern for async code?")
# Internally generates: "How should I write asynchronous Python?",
# "What are async patterns?", etc.
Multi-query improves recall by searching from multiple angles, especially for questions phrased differently than your training data.
Hybrid Search (Dense + Sparse): Combine semantic similarity (embeddings) with keyword matching (BM25):
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# Sparse retriever (keyword-based)
bm25_retriever = BM25Retriever.from_documents(chunks)
# Dense retriever (semantic)
dense_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
# Combine them
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.3, 0.7] # BM25 contributes 30%, embeddings 70%
)
docs = ensemble_retriever.invoke("Python asyncio")
# Returns best matches from both sparse and dense, weighted and deduplicated
Hybrid search prevents overfitting to embeddings. If your docs have unique terminology, keywords catch what embeddings miss.
Reranking: Improving Retrieval Quality
Retrieved documents aren't ranked optimally. Rerankers re-score results by relevance:
from langchain_community.document_compressors import CohereReranker
from langchain.retrievers import ContextualCompressionRetriever
# Retrieve documents
base_retriever = vector_store.as_retriever(search_kwargs={"k": 10})
# Rerank with Cohere
reranker = CohereReranker(
model="rerank-english-v2.0",
top_n=3 # Return top 3 after reranking
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever
)
docs = compression_retriever.invoke("How do I use async/await?")
# Returns 3 documents, reranked by relevance
Reranking increases precision—you get the most relevant documents, not just the closest embeddings. Cost: one reranker call per query.
Query Optimization: Preprocessing User Input
Improve retrieval by preprocessing queries:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Step 1: Expand acronyms and clarify vague terms
clarify_prompt = ChatPromptTemplate.from_template("""
Expand any acronyms and clarify vague terms in this query:
Query: {question}
Expanded query (keep it concise):""")
clarify_chain = clarify_prompt | model | StrOutputParser()
# Step 2: Generate multiple query variations
expand_prompt = ChatPromptTemplate.from_template("""
Generate 3 concise, distinct phrasings of this question:
Question: {question}
Phrasings (one per line):""")
expand_chain = expand_prompt | model | StrOutputParser()
# Use in retrieval
question = "How to use async await?"
clarified = clarify_chain.invoke({"question": question})
expanded = expand_chain.invoke({"question": clarified})
# Retrieve with expanded queries
This two-step process cleans input before retrieval, improving document matching.
Multi-Hop Retrieval: Iterative Refinement
For complex questions requiring multiple retrieval steps:
from langchain_core.runnables import RunnableLambda
def multi_hop_retrieve(question, retriever, model, max_hops=3):
"""Iteratively retrieve documents, asking follow-up questions."""
context = ""
for hop in range(max_hops):
# Retrieve documents for current question
docs = retriever.invoke(question)
context += "\n".join([doc.page_content for doc in docs])
# Ask model if more information is needed
followup_prompt = ChatPromptTemplate.from_template("""
Based on the context and question, do you need more information?
Question: {question}
Context so far: {context}
If you need more info, generate a follow-up question.
Otherwise respond with DONE.
""")
response = (followup_prompt | model | StrOutputParser()).invoke({
"question": question,
"context": context
})
if "DONE" in response:
break
question = response # Use follow-up as next query
return context
context = multi_hop_retrieve(
"What are the best practices for testing async code?",
retriever,
model
)
Multi-hop retrieval handles questions that can't be answered from a single document. The model decides when to stop searching.
Context Compression and Formatting
After retrieval, format context for the model:
from langchain.text_splitters import CharacterTextSplitter
def format_context(documents, max_tokens=1000):
"""Compress context to fit token budget."""
# Simple approach: concatenate until token budget
text = ""
for doc in documents:
candidate = text + doc.page_content + "\n\n"
# Rough estimate: 1 token ~= 4 chars
if len(candidate) / 4 < max_tokens:
text = candidate
else:
break
return text
context = format_context(retrieved_docs, max_tokens=2000)
prompt = ChatPromptTemplate.from_template("""
Answer the question using this context:
{context}
Question: {question}
Answer:""")
result = (prompt | model | StrOutputParser()).invoke({
"context": context,
"question": user_question
})
Formatting controls context size and prevents token overrun.
RAG Evaluation Metrics
Measure retrieval and generation quality:
from langchain.evaluation import EvaluatorChain
# Evaluate retrieval: does it return relevant documents?
def evaluate_retrieval(question, retrieved_docs, ground_truth_docs):
"""Return precision and recall of retrieval."""
retrieved_ids = {doc.metadata.get("id") for doc in retrieved_docs}
truth_ids = {doc.metadata.get("id") for doc in ground_truth_docs}
precision = len(retrieved_ids & truth_ids) / len(retrieved_ids)
recall = len(retrieved_ids & truth_ids) / len(truth_ids)
return {"precision": precision, "recall": recall}
# Evaluate generation: is the answer correct and supported?
evaluation_prompt = ChatPromptTemplate.from_template("""
Does this answer correctly address the question using the provided context?
Question: {question}
Context: {context}
Answer: {answer}
Rate as CORRECT, PARTIAL, or INCORRECT.""")
# Run eval
eval_result = (evaluation_prompt | model | StrOutputParser()).invoke({
"question": question,
"context": context,
"answer": generated_answer
})
Use these metrics to iterate on RAG design—if retrieval precision is low, improve your embeddings or add reranking.
RAG Pipeline Pattern Comparison
| Pattern | Complexity | Recall | Precision | Cost |
|---|---|---|---|---|
| Simple retrieval | Very low | Medium | Low | 1x |
| Multi-query | Low | High | Medium | 3–5x |
| Hybrid search | Low | Very high | Medium | 2x |
| Reranking | Medium | High | High | 3–10x |
| Multi-hop | High | Very high | Very high | 5–20x |
Key Takeaways
- Multi-query expansion retrieves from multiple angles to improve recall
- Hybrid search combines semantic and keyword matching for robustness
- Reranking re-scores results to improve precision and relevance
- Query optimization preprocesses user input before retrieval
- Multi-hop retrieval iteratively searches, asking follow-up questions
- Evaluate both retrieval (precision/recall) and generation (correctness)
- Choose patterns based on question complexity and accuracy requirements
Frequently Asked Questions
How do I know if my retrieval is good enough?
Measure precision and recall on a test set. Aim for 80%+ precision (most results are relevant) and 60%+ recall (you're finding the necessary documents). If both are low, adjust chunk size, overlap, or switch to hybrid search.
Should I use reranking?
Yes, if accuracy matters. Reranking adds latency (100-500ms per query) but significantly improves precision. Disable for latency-sensitive applications.
What's the difference between multi-query and multi-hop?
Multi-query generates variations of the same question and retrieves from all. Multi-hop retrieves, then asks a follow-up question based on the first result. Multi-hop iterates; multi-query is parallel.
How do I handle questions that don't need retrieval?
Add a classifier before retrieval: if the question is factual and general, answer directly. Only retrieve for domain-specific questions. This saves latency.
Can I use RAG with structured data (databases, APIs)?
Yes, but retrieval is different. Create documents from database records, embed them, and index. Or query the database directly based on the user's question (use tools/agents for this).