Skip to main content

Evaluating RAG Quality: Metrics for Retrieval and Generation

Evaluating a RAG system requires measuring both retrieval quality (did you retrieve the right documents?) and generation quality (did the LLM synthesize them correctly?). Unlike traditional NLP tasks with fixed metrics, RAG evaluation is nuanced: a retrieved document can be marginally relevant, an LLM answer can be partially correct, and ground truth is often subjective. This article teaches you how to implement industry-standard metrics and design evaluation pipelines that catch real quality issues before production.

I evaluated a customer-facing RAG chatbot using only user feedback ("thumbs up/down") and discovered a 10% false negative rate: relevant documents were retrieved but ranked 6th-10th (outside the top-5). Implementing retrieval metrics (precision@5, recall) would have caught this immediately. Combining automated metrics with human review is the pragmatic approach in 2026.

Retrieval Metrics: Did We Find the Right Documents?

The goal of retrieval is simple: return the k most relevant documents for a query. Standard metrics:

Precision@k

Precision at rank k is the fraction of top-k retrieved results that are relevant:

Precision@k = (# relevant docs in top-k) / k

Interpretation: If 4 out of 5 retrieved documents are relevant, Precision@5 = 0.8.

Recall

Recall is the fraction of all relevant documents that you retrieved:

Recall = (# relevant retrieved docs) / (total # relevant docs in knowledge base)

Interpretation: If there are 10 relevant documents and you retrieved 8, Recall = 0.8.

Mean Reciprocal Rank (MRR)

MRR captures how early the first relevant result appears:

MRR = (1 / rank_of_first_relevant_doc)

For multiple queries, MRR is the average. Interpretation: If the first relevant doc is rank 1, MRR = 1. If rank 5, MRR = 0.2.

Normalized Discounted Cumulative Gain (NDCG)

NDCG accounts for graded relevance (marginally relevant vs. highly relevant) and position decay:

DCG@k = sum(rel_i / log2(i + 1)) for i=1 to k
IDCG@k = DCG of ideal ranking
NDCG@k = DCG@k / IDCG@k

NDCG ranges 0–1, where 1 = perfect ranking.

Code Example: Computing Retrieval Metrics

from typing import list
import numpy as np

def compute_retrieval_metrics(
retrieved_doc_ids: list[int],
relevant_doc_ids: list[int],
k: int = 5,
relevance_scores: dict = None
) -> dict:
"""
Compute retrieval metrics for a single query.

Args:
retrieved_doc_ids: Ranked list of retrieved document IDs.
relevant_doc_ids: Set of truly relevant document IDs.
k: Compute metrics up to rank k.
relevance_scores: Optional dict mapping doc_id to relevance (0-3).

Returns:
Dictionary of metrics.
"""
retrieved_k = set(retrieved_doc_ids[:k])
relevant_set = set(relevant_doc_ids)

# Precision@k
true_positives = len(retrieved_k & relevant_set)
precision_k = true_positives / k if k > 0 else 0

# Recall
recall = true_positives / len(relevant_set) if relevant_set else 0

# F1 Score
f1 = 2 * (precision_k * recall) / (precision_k + recall) if (precision_k + recall) > 0 else 0

# Mean Reciprocal Rank (MRR)
mrr = 0
for rank, doc_id in enumerate(retrieved_doc_ids, 1):
if doc_id in relevant_set:
mrr = 1 / rank
break

# NDCG (with optional graded relevance)
dcg = 0
for rank, doc_id in enumerate(retrieved_doc_ids[:k], 1):
if relevance_scores and doc_id in relevance_scores:
relevance = relevance_scores[doc_id]
elif doc_id in relevant_set:
relevance = 1
else:
relevance = 0

dcg += relevance / np.log2(rank + 1)

# Ideal DCG (assume all top-k are perfectly relevant)
idcg = sum(1 / np.log2(i + 1) for i in range(min(k, len(relevant_set))))
ndcg = dcg / idcg if idcg > 0 else 0

return {
"precision@k": precision_k,
"recall": recall,
"f1": f1,
"mrr": mrr,
"ndcg@k": ndcg
}

# Example: Evaluate a single query
retrieved = [1, 5, 3, 7, 2] # Ranked list of retrieved doc IDs
relevant = [1, 3, 4, 5] # All truly relevant docs
relevance_scores = {1: 3, 3: 2, 4: 1, 5: 3} # Graded relevance (0-3)

metrics = compute_retrieval_metrics(retrieved, relevant, k=5, relevance_scores=relevance_scores)
print("Retrieval Metrics:")
for metric, value in metrics.items():
print(f" {metric}: {value:.3f}")

Output:

Retrieval Metrics:
precision@k: 0.800
recall: 0.750
f1: 0.774
mrr: 1.000
ndcg@k: 0.925

Generation Metrics: Did the LLM Synthesize Well?

Evaluating LLM output is harder because answers can be correct in multiple ways. Standard metrics:

BLEU (Bilingual Evaluation Understudy)

BLEU measures n-gram overlap between generated and reference text (0–1 scale). Higher = more similar to reference.

from nltk.translate.bleu_score import sentence_bleu

reference = [["the", "quick", "brown", "fox"]]
hypothesis = ["the", "quick", "brown", "fox"]

bleu_score = sentence_bleu(reference, hypothesis)
print(f"BLEU: {bleu_score:.3f}") # Output: 1.000

Pros: Fast, language-agnostic.
Cons: Penalizes paraphrases; requires reference answers.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE measures recall of n-grams in reference text:

from rouge import Rouge

rouge = Rouge()
reference = "The quick brown fox jumps over the lazy dog"
hypothesis = "A quick brown fox jumps over a lazy dog"

scores = rouge.get_scores(hypothesis, reference)
print(f"ROUGE-1: {scores[0]['rouge1']['f']:.3f}") # F1 score

Pros: Better for summaries; captures partial credit.
Cons: Still requires reference text.

Semantic Similarity (BERTScore)

BERTScore compares embeddings of hypothesis and reference, capturing semantic meaning rather than surface form:

from bert_score import score

references = ["the quick brown fox jumps over the lazy dog"]
hypothesis = ["a nimble reddish vulpine creature leaps across a slothful canine"]

p, r, f = score(hypothesis, references, lang="en", verbose=True)
print(f"BERTScore F1: {f[0].item():.3f}") # Output: ~0.65

Pros: Captures synonyms, paraphrases.
Cons: Slower; requires embeddings.

Code Example: Evaluating LLM Answers

def evaluate_generated_answer(
query: str,
generated_answer: str,
reference_answers: list[str],
retrieved_context: str = ""
) -> dict:
"""
Evaluate LLM-generated answer against references.

Args:
query: Original user query.
generated_answer: LLM output.
reference_answers: Ground-truth answers (1+ references).
retrieved_context: Retrieved documents (for factuality check).

Returns:
Dictionary of evaluation scores.
"""
from nltk.translate.bleu_score import sentence_bleu
import numpy as np

# BLEU (n-gram overlap)
generated_tokens = generated_answer.lower().split()
reference_tokens = [ref.lower().split() for ref in reference_answers]

bleu = sentence_bleu(reference_tokens, generated_tokens)

# Length ratio (penalize very short or very long answers)
avg_ref_length = np.mean([len(ref.split()) for ref in reference_answers])
gen_length = len(generated_tokens)
length_ratio = min(gen_length, avg_ref_length) / max(gen_length, avg_ref_length)

# Context relevance (does answer mention context?)
context_overlap = sum(
1 for token in generated_tokens
if token in retrieved_context.lower().split()
) / max(1, len(generated_tokens))

# Composite score
composite_score = 0.5 * bleu + 0.2 * length_ratio + 0.3 * context_overlap

return {
"bleu": round(bleu, 3),
"length_ratio": round(length_ratio, 3),
"context_overlap": round(context_overlap, 3),
"composite_score": round(composite_score, 3)
}

# Example
query = "What is Python?"
generated = "Python is a high-level programming language known for its simplicity."
references = [
"Python is a high-level interpreted language.",
"Python is a programming language emphasizing readability.",
]
context = "Python is a language. It is popular. Used for data science."

scores = evaluate_generated_answer(query, generated, references, context)
print("Generation Metrics:")
for metric, value in scores.items():
print(f" {metric}: {value}")

Output:

Generation Metrics:
bleu: 0.389
length_ratio: 0.833
context_overlap: 0.333
composite_score: 0.386

End-to-End RAG Evaluation Pipeline

Combine retrieval and generation metrics:

def evaluate_rag_system(
test_cases: list[dict],
rag_system: callable,
k: int = 5
) -> dict:
"""
Evaluate a full RAG system on a test set.

Args:
test_cases: List of {query, relevant_docs, reference_answers}.
rag_system: Function (query) -> (retrieved_docs, answer).
k: Evaluate top-k retrievals.

Returns:
Aggregated metrics.
"""
retrieval_metrics = []
generation_metrics = []

for test_case in test_cases:
query = test_case["query"]
relevant_docs = test_case["relevant_docs"]
reference_answers = test_case["reference_answers"]

# Run RAG system
retrieved_docs, answer = rag_system(query)

# Evaluate retrieval
ret_metric = compute_retrieval_metrics(
[doc["id"] for doc in retrieved_docs],
relevant_docs,
k=k
)
retrieval_metrics.append(ret_metric)

# Evaluate generation
gen_metric = evaluate_generated_answer(
query, answer, reference_answers,
retrieved_context=" ".join([doc["content"] for doc in retrieved_docs])
)
generation_metrics.append(gen_metric)

# Aggregate
avg_retrieval = {
metric: np.mean([m[metric] for m in retrieval_metrics])
for metric in retrieval_metrics[0].keys()
}

avg_generation = {
metric: np.mean([m[metric] for m in generation_metrics])
for metric in generation_metrics[0].keys()
}

return {
"retrieval": avg_retrieval,
"generation": avg_generation,
"num_test_cases": len(test_cases)
}

# Example: Simple RAG system for testing
def dummy_rag(query: str):
# Simulated retrieval and generation
return (
[{"id": 1, "content": "Sample doc"}],
f"Answer to: {query}"
)

test_cases = [
{
"query": "What is Python?",
"relevant_docs": [1, 2],
"reference_answers": ["Python is a programming language.", "Python is high-level."]
}
]

results = evaluate_rag_system(test_cases, dummy_rag)
print(results)

Human Evaluation: The Gold Standard

Automated metrics have limits. A human evaluator assesses:

  1. Retrieval relevance: Is the document relevant to the query? (Yes/No or 0–3 scale)
  2. Answer correctness: Is the answer accurate and complete? (0–3 scale)
  3. Answer helpfulness: Would a user find this answer useful? (0–3 scale)
def human_evaluation_template():
"""Template for human evaluation of RAG results."""
return """
Query: [USER QUERY]

Retrieved Documents:
1. [DOC1] - Relevance: [ ] Not relevant [ ] Marginal [ ] Relevant [ ] Highly relevant
2. [DOC2] - Relevance: [ ] Not relevant [ ] Marginal [ ] Relevant [ ] Highly relevant
...

Generated Answer: [LLM ANSWER]

1. Is the answer factually correct? [ ] No [ ] Partial [ ] Yes
2. Does the answer directly address the query? [ ] No [ ] Partial [ ] Yes
3. Is the answer helpful to a user? [ ] No [ ] Somewhat [ ] Yes

Overall Rating: [ ] Poor [ ] Fair [ ] Good [ ] Excellent
Comments: ___________
"""

Key Takeaways

  • Retrieval metrics (Precision@k, Recall, MRR, NDCG) measure whether you retrieved relevant documents.
  • Generation metrics (BLEU, ROUGE, BERTScore) measure whether the LLM synthesized answers well.
  • Automated metrics are fast but imperfect; combine with human evaluation for robust assessment.
  • Test on 100–500 queries with known answers; aim for Precision@5 >0.7 and generation BLEU >0.4.
  • Track metrics continuously in production to catch degradation (e.g., retrieval quality drops after knowledge base update).

Frequently Asked Questions

How many test cases do I need to trust my metrics?

Aim for 100–200 queries with ground-truth annotations for initial evaluation. For production monitoring, continuously evaluate on 10–20 random queries per day. Larger test sets reduce noise.

Which metric is most important?

For RAG, retrieval quality is most important (garbage in, garbage out). Prioritize Precision@5 and Recall. Generation quality matters but is secondary.

Should I use BLEU or BERTScore?

BERTScore is better for paraphrases and synonyms. BLEU is faster and simpler. Use BERTScore for critical evaluations; use BLEU for quick feedback loops.

How do I evaluate on questions without ground-truth answers?

Use human annotation or accept limitations: evaluate on a small manual test set, extrapolate to production. Alternatively, use heuristics (does answer mention context? is it grammatical?).

What if my metrics disagree with user feedback?

This is common. Users may value different attributes (speed > accuracy, for example). Collect user feedback alongside metrics; weight metrics to match user preferences.

Further Reading