Advanced LLM Patterns: RAG and Agents
Retrieval-Augmented Generation (RAG) and agent patterns represent the frontier of LLM applications. RAG augments language models with external knowledge by retrieving relevant documents and injecting them into prompts, enabling models to answer questions about proprietary data or facts beyond their training cutoff. Agents are systems where the LLM decides what tool to call next, iteratively reasoning and taking actions until solving a task. Together, RAG and agents transform LLMs from passive text generators into active systems that can search, learn, reason, and take action on real-world problems.
Retrieval-Augmented Generation (RAG): The Concept
RAG solves a core LLM problem: hallucination caused by outdated training data. Instead of relying solely on the model's knowledge, RAG retrieves relevant documents from an external source (database, search engine, knowledge base) and includes them in the prompt. The model generates answers grounded in retrieved facts:
from openai import OpenAI
import json
client = OpenAI()
# Simulated knowledge base (in practice, this is a vector database)
KNOWLEDGE_BASE = {
"Python list": "A Python list is an ordered, mutable collection of items. Lists are created with square brackets: my_list = [1, 2, 3]. Items are accessed by index: my_list[0] returns the first item.",
"Python tuple": "A Python tuple is an ordered, immutable collection of items. Tuples are created with parentheses: my_tuple = (1, 2, 3). Tuples cannot be modified after creation.",
"Python dictionary": "A Python dictionary is an unordered collection of key-value pairs. Dictionaries are created with curly braces: my_dict = {'key': 'value'}. Items are accessed by key: my_dict['key']."
}
def retrieve_context(query, top_k=2):
"""Simple keyword-based retrieval (in practice, use vector similarity)."""
query_lower = query.lower()
results = []
for title, content in KNOWLEDGE_BASE.items():
# Very naive: count keyword matches
matches = sum(1 for word in query_lower.split() if word in title.lower())
if matches > 0:
results.append((title, content, matches))
# Return top-k results
results.sort(key=lambda x: x[2], reverse=True)
return results[:top_k]
def rag_query(user_question):
"""Answer a question using retrieved context."""
# Step 1: Retrieve relevant documents
retrieved_docs = retrieve_context(user_question)
# Step 2: Augment prompt with retrieved context
context_text = "\n\n".join(
f"[{title}]\n{content}" for title, content, _ in retrieved_docs
)
augmented_prompt = f"""Use the following context to answer the question.
If the context does not contain the answer, say so.
Context:
{context_text}
Question: {user_question}"""
# Step 3: Generate answer using the model
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful Python tutor. Answer based on the provided context."},
{"role": "user", "content": augmented_prompt}
]
)
return response.choices[0].message.content
# Usage
answer = rag_query("What is a Python list?")
print(answer)
# The answer is grounded in the retrieved context, not hallucination
print("\n---\n")
answer2 = rag_query("Difference between list and tuple?")
print(answer2)
RAG is powerful because it shifts the LLM's role from a knowledge source to a reasoner. The model focuses on synthesizing retrieved information rather than recalling facts. This reduces hallucination and allows answering questions about domain-specific or proprietary data.
Vector Databases and Semantic Search
Keyword-based retrieval is fragile. Vector databases enable semantic search: embedding documents and queries as vectors, then finding similar vectors. Libraries like LangChain simplify this:
from openai import OpenAI, OpenAIEmbedding
import json
import numpy as np
client = OpenAI()
# In practice, use a vector database (Pinecone, Weaviate, Milvus, or Chroma)
class SimpleVectorStore:
def __init__(self):
self.documents = []
self.embeddings = []
def add_documents(self, docs):
"""Embed and store documents."""
for doc in docs:
# Get embedding from OpenAI (or use a local model for speed)
response = client.embeddings.create(
input=doc["content"],
model="text-embedding-3-small"
)
embedding = response.data[0].embedding
self.documents.append(doc)
self.embeddings.append(embedding)
def search(self, query, top_k=2):
"""Retrieve top-k documents similar to query."""
# Embed the query
response = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = response.data[0].embedding
# Compute similarity (cosine) to all documents
similarities = []
for i, doc_embedding in enumerate(self.embeddings):
# Cosine similarity
similarity = np.dot(query_embedding, doc_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding)
)
similarities.append((i, similarity))
# Return top-k
similarities.sort(key=lambda x: x[1], reverse=True)
return [self.documents[i] for i, _ in similarities[:top_k]]
# Usage
store = SimpleVectorStore()
documents = [
{"title": "Python Lists", "content": "Lists are ordered, mutable collections..."},
{"title": "Python Tuples", "content": "Tuples are ordered, immutable sequences..."},
{"title": "Python Dictionaries", "content": "Dicts are key-value pairs..."}
]
store.add_documents(documents)
results = store.search("ordered immutable collection", top_k=1)
print(f"Best match: {results[0]['title']}")
Vector databases handle embedding, indexing, and similarity search efficiently. For production, use managed services like Pinecone or self-hosted solutions like Chroma (lightweight) or Milvus (scalable).
Agent Loop: Reasoning and Tool Calling
Agents combine language models with tools in a loop: the model decides which tool to call, you execute it, and the model uses the result to decide next steps:
from openai import OpenAI
import json
client = OpenAI()
def calculate(expression):
"""Execute a math expression safely."""
try:
return json.dumps({"result": eval(expression)})
except Exception as e:
return json.dumps({"error": str(e)})
def get_weather(location):
"""Simulated weather API."""
weather_db = {
"New York": "Sunny, 72°F",
"London": "Cloudy, 55°F"
}
return json.dumps(weather_db.get(location, "Unknown location"))
def agent_loop(user_task, max_iterations=10):
"""Run an agentic reasoning loop."""
tools = [
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a mathematical expression",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Python expression to evaluate"}
},
"required": ["expression"]
}
}
},
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}
]
messages = [
{"role": "system", "content": "You are a helpful assistant. Use available tools to answer questions. Think step-by-step."},
{"role": "user", "content": user_task}
]
for iteration in range(max_iterations):
# Step 1: Ask the model what to do
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools
)
# Check if the model is done
if not response.choices[0].message.tool_calls:
print(f"Agent finished (iteration {iteration + 1}):")
print(response.choices[0].message.content)
break
# Step 2: Execute tool calls
messages.append(response.choices[0].message)
for tool_call in response.choices[0].message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
print(f"Calling {func_name}({func_args})")
if func_name == "calculate":
result = calculate(func_args["expression"])
elif func_name == "get_weather":
result = get_weather(func_args["location"])
else:
result = json.dumps({"error": "Unknown function"})
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
# Usage
agent_loop("What is 25% of 200, and what is the weather in New York?")
The agent loop continues until the model decides no more tools are needed. This pattern scales to complex, multi-step workflows: data analysis, research, planning, problem-solving.
Hybrid Approach: RAG with Agents
Combining RAG and agents creates powerful systems: agents decide what to retrieve, retrieve documents, and reason over them:
from openai import OpenAI
import json
client = OpenAI()
# Knowledge base
KB = {
"Python async": "Async/await enables concurrent I/O without threads. Use 'async def' and 'await'.",
"Python threads": "Threads run concurrently but share memory. Use threading module. GIL limits CPU parallelism."
}
def retrieve_documents(query):
"""Retrieve docs matching a query."""
results = []
for title, content in KB.items():
if any(word in title.lower() for word in query.lower().split()):
results.append({"title": title, "content": content})
return json.dumps(results)
def hybrid_agent(user_question):
"""Agent that uses RAG to answer questions."""
tools = [
{
"type": "function",
"function": {
"name": "retrieve_documents",
"description": "Search the knowledge base for relevant documents",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
}
]
messages = [
{"role": "system", "content": "You are a Python expert. Use the knowledge base to answer questions."},
{"role": "user", "content": user_question}
]
# First turn: decide what to retrieve
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools
)
if response.choices[0].message.tool_calls:
messages.append(response.choices[0].message)
# Execute retrieval
for tool_call in response.choices[0].message.tool_calls:
query = json.loads(tool_call.function.arguments)["query"]
docs = retrieve_documents(query)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": docs
})
# Second turn: answer based on retrieved docs
final_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return final_response.choices[0].message.content
return response.choices[0].message.content
# Usage
answer = hybrid_agent("Difference between async and threads in Python?")
print(answer)
This pattern combines RAG's grounded reasoning with agents' adaptive retrieval strategy. The agent decides what knowledge is relevant and retrieves it on demand.
Key Takeaways
- RAG retrieves external documents and injects them into prompts, reducing hallucination and enabling domain-specific knowledge.
- Vector databases enable semantic search: embed documents and queries, find similar vectors.
- Agents use tool calling to decide what to do next, reasoning across multiple steps.
- Combine RAG and agents for adaptive systems that retrieve knowledge on demand and reason over it.
- Test RAG systems on a held-out evaluation set to measure retrieval accuracy and answer quality.
- For production, use managed vector databases (Pinecone, Weaviate) or self-hosted (Chroma, Milvus).
Frequently Asked Questions
How do I evaluate RAG system quality?
Measure retrieval accuracy (did the right documents get retrieved?) and answer accuracy (is the final answer correct?). Use metrics like NDCG (normalized discounted cumulative gain) for retrieval and BLEU/ROUGE for answer quality. Manually rate a sample of results.
What is the latency of RAG systems?
RAG adds retrieval latency (typically 10–100 ms for vector similarity search) to the LLM latency. Total time is usually 1–3 seconds. For interactive applications, use caching and async retrieval to minimize perceived latency.
Can I use RAG with streaming?
Yes. Retrieve documents, start streaming the response, and continue streaming while retrieval happens in parallel. This keeps the user engaged while the system gathers context.
How do I keep the knowledge base fresh?
Use document versioning: store multiple versions of documents with timestamps. When indexing, include the timestamp in embeddings. Periodically re-embed new documents and remove outdated ones.
What if the retrieved documents are contradictory?
The LLM will synthesize them or note the contradiction. To improve this, include source citations in the prompt so the model explains which document it is citing.