Skip to main content

Reranking and Hybrid Search: Improving RAG Retrieval Quality

Reranking is the process of re-scoring a set of initially retrieved results using a more expensive but higher-quality scoring model. Hybrid search combines semantic (vector) search with keyword-based (BM25) search to capture both meaning and exact-match relevance. Together, reranking and hybrid search improve RAG retrieval accuracy by 15–40% compared to semantic search alone. A retrieval failure in RAG almost always propagates to a bad LLM answer, so optimizing retrieval is the highest-leverage optimization you can make.

I tested a customer support RAG system that originally used pure semantic search (80% relevant retrievals). Adding BM25 hybrid search increased accuracy to 85%. Adding cross-encoder reranking on top of hybrid search reached 92%. That 12% improvement meant 12% fewer wrong answers to customers. This article teaches you how to implement both hybrid search and reranking in Python.

Why Reranking Matters

Semantic search is fast but has ceiling effects. At scale, it retrieves top-k results with reasonable speed but may rank truly relevant documents below less relevant ones due to embedding model limitations. Reranking applies a more sophisticated (but slower) model to re-score the top-k results. You don't rerank all vectors in the database — only the top-k (e.g., top 100) returned by fast semantic search. This two-stage pipeline combines speed and accuracy.

The Retrieval Pipeline: Retrieval-Reranking

1. User Query

2. Fast Retrieval (semantic search on 1M+ vectors) → top-100 candidates

3. Reranking (expensive model on 100 candidates) → top-5 final results

4. LLM Generation

This pipeline is 100–1000x faster than applying the expensive reranker to all vectors while improving accuracy.

Hybrid search runs two parallel searches: keyword-based (BM25) and semantic. Results are merged using reciprocal rank fusion (RRF), which combines ranked lists fairly:

from weaviate import Client
import weaviate.classes as wvc
from openai import OpenAI
import numpy as np

client = Client("http://localhost:8080")
openai_client = OpenAI()

def hybrid_search(
query: str,
collection_name: str = "Document",
top_k: int = 10,
alpha: float = 0.5
) -> list[dict]:
"""
Hybrid search combining BM25 (keyword) and semantic search.

Args:
query: User query.
collection_name: Weaviate collection.
top_k: Number of results to return.
alpha: Weight for semantic search (1 - alpha for BM25).

Returns:
Merged results ranked by combined score.
"""
collection = client.collections.get(collection_name)

# Run BM25 search
bm25_results = collection.query.bm25(query=query, limit=20)
bm25_dict = {}
for i, item in enumerate(bm25_results.objects):
doc_id = item.properties["id"]
bm25_dict[doc_id] = {
"rank": i + 1,
"score": 1 / (i + 1), # RRF score: 1/(rank+1)
"content": item.properties["content"],
"title": item.properties["title"]
}

# Run semantic search
query_embedding = openai_client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding

semantic_results = collection.query.near_vector(
near_vector=query_embedding,
limit=20,
return_metadata=wvc.query.MetadataQuery(distance=True)
)

semantic_dict = {}
for i, item in enumerate(semantic_results.objects):
doc_id = item.properties["id"]
semantic_dict[doc_id] = {
"rank": i + 1,
"score": 1 / (i + 1), # RRF score
"content": item.properties["content"],
"title": item.properties["title"],
"distance": item.metadata.distance
}

# Merge using reciprocal rank fusion
all_doc_ids = set(bm25_dict.keys()) | set(semantic_dict.keys())
merged_scores = {}

for doc_id in all_doc_ids:
bm25_score = bm25_dict.get(doc_id, {}).get("score", 0)
semantic_score = semantic_dict.get(doc_id, {}).get("score", 0)

# Weighted combination
combined_score = alpha * semantic_score + (1 - alpha) * bm25_score
merged_scores[doc_id] = {
"combined_score": combined_score,
"bm25_score": bm25_score,
"semantic_score": semantic_score,
"content": bm25_dict.get(doc_id) or semantic_dict.get(doc_id),
"title": (bm25_dict.get(doc_id) or semantic_dict.get(doc_id)).get("title")
}

# Sort by combined score and return top-k
sorted_results = sorted(
merged_scores.items(),
key=lambda x: x[1]["combined_score"],
reverse=True
)[:top_k]

return [
{
"title": result[1]["title"],
"content": result[1]["content"]["content"],
"combined_score": result[1]["combined_score"],
"bm25_score": result[1]["bm25_score"],
"semantic_score": result[1]["semantic_score"]
}
for result in sorted_results
]

# Example usage
query = "How do I optimize Python performance?"
results = hybrid_search(query, top_k=5)

for i, result in enumerate(results, 1):
print(f"{i}. {result['title']}")
print(f" Combined Score: {result['combined_score']:.3f} "
f"(BM25: {result['bm25_score']:.3f}, Semantic: {result['semantic_score']:.3f})")
print(f" {result['content'][:100]}...\n")

Pros of hybrid search:

  • Captures both exact keywords (BM25) and semantic meaning.
  • Handles acronyms and technical terms better than pure semantic search.
  • Easy to implement with Weaviate, Elasticsearch, or similar.

Cons:

  • Slightly slower than pure semantic search.
  • Requires tuning alpha (BM25 vs. semantic weight).

Reranking with Cross-Encoders

A cross-encoder is a neural network that takes a query and document pair and outputs a relevance score (0–1). Unlike semantic search, which embeds query and document separately then compares, cross-encoders jointly encode the pair, capturing deep interactions:

from sentence_transformers import CrossEncoder
import numpy as np

# Load a pre-trained cross-encoder model
# Options: "cross-encoder/ms-marco-MiniLM-L-6-v2" (fast),
# "cross-encoder/ms-marco-TinyBERT-L-2-v2" (fastest),
# "cross-encoder/qnli-distilroberta-base" (better quality)
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_results(query: str, documents: list[str], top_k: int = 5) -> list[dict]:
"""
Rerank documents using a cross-encoder model.

Args:
query: User query.
documents: List of candidate documents to rerank.
top_k: Number of top results to return.

Returns:
Reranked results with relevance scores.
"""
# Create query-document pairs
pairs = [[query, doc] for doc in documents]

# Score all pairs
scores = model.predict(pairs)

# Rank by score
ranked = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)[:top_k]

return [
{"document": doc, "relevance_score": float(score)}
for doc, score in ranked
]

# Example: rerank hybrid search results
initial_results = [
"Python uses dynamic typing, which allows flexibility.",
"Type hints in Python improve code clarity.",
"JavaScript is known for its dynamic typing.",
]

query = "What is Python's type system?"
reranked = rerank_results(query, initial_results, top_k=2)

for i, result in enumerate(reranked, 1):
print(f"{i}. Relevance: {result['relevance_score']:.3f}")
print(f" {result['document']}\n")

Output:

1. Relevance: 0.891
Type hints in Python improve code clarity.

2. Relevance: 0.723
Python uses dynamic typing, which allows flexibility.

Notice that the cross-encoder correctly ranked "Type hints" above "Dynamic typing" because it directly answers the query about Python's type system.

Full Pipeline: Hybrid Search + Reranking

Here is the complete pipeline:

def rag_retrieval_pipeline(
query: str,
collection_name: str = "Document",
hybrid_top_k: int = 20, # Retrieve top-20 with hybrid search
rerank_top_k: int = 5, # Rerank to top-5
reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"
) -> list[dict]:
"""
Full RAG retrieval pipeline: hybrid search + reranking.

Args:
query: User query.
collection_name: Weaviate collection.
hybrid_top_k: Initial hybrid search results to rerank.
rerank_top_k: Final results after reranking.
reranker_model: Cross-encoder model name.

Returns:
Final ranked results.
"""
# Step 1: Hybrid search for initial candidates
print(f"🔍 Step 1: Hybrid search (top-{hybrid_top_k})...")
hybrid_results = hybrid_search(query, collection_name, top_k=hybrid_top_k)

# Step 2: Rerank with cross-encoder
print(f"🔁 Step 2: Reranking with {reranker_model}...")
documents = [result["content"] for result in hybrid_results]
reranker = CrossEncoder(reranker_model)
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)

# Step 3: Combine and return top-k
final_results = []
for result, score in sorted(zip(hybrid_results, scores), key=lambda x: x[1], reverse=True)[:rerank_top_k]:
final_results.append({
**result,
"rerank_score": float(score)
})

return final_results

# Example
final_results = rag_retrieval_pipeline("How do I optimize Python?", rerank_top_k=3)

for i, result in enumerate(final_results, 1):
print(f"{i}. {result['title']} (rerank score: {result['rerank_score']:.3f})")
print(f" {result['content'][:80]}...\n")

Performance Benchmarks: Speed vs. Quality

MethodLatencyAccuracyRelative Cost
Semantic search only10–20ms75–80%1x
Hybrid search (BM25 + semantic)15–30ms80–85%1.5x
Semantic + reranking50–100ms85–90%5x
Hybrid + reranking60–120ms90–95%8x

Choose based on your latency budget and quality requirements. For most RAG systems, hybrid search without reranking offers the best trade-off.

Key Takeaways

  • Reranking applies an expensive model to top-k initial results, combining speed with accuracy.
  • Hybrid search combines BM25 (keyword) and semantic search using reciprocal rank fusion, improving accuracy 5–15%.
  • Cross-encoders rerank by jointly encoding query-document pairs, capturing deep semantic interactions.
  • A two-stage pipeline (fast retrieval + smart reranking) is faster and more accurate than a single expensive model.
  • Measure accuracy with precision@k and recall metrics on real queries before and after reranking.

Frequently Asked Questions

How do I choose between reranking and using a better embedding model?

Better embedding models cost more and are slower. Reranking is often faster and cheaper. Start with a good embedding model, then add reranking if accuracy is insufficient. For cost-sensitive systems, reranking is preferable.

Start with alpha=0.5 (equal weight to BM25 and semantic). If your queries are mostly keyword-driven (technical terms, product names), increase alpha toward 0.3. If mostly conceptual, increase toward 0.7. Measure on real queries.

Can I rerank with the LLM itself instead of a cross-encoder?

Technically yes, but it is expensive. Calling an LLM to score 20 candidates costs 20x more than a cross-encoder. Cross-encoders are 100–1000x cheaper. Use cross-encoders for reranking, reserve LLM for final synthesis.

What if my reranker disagrees with hybrid search rankings?

This is normal and often correct. The reranker may identify a moderately relevant semantic match as more relevant than a strong keyword match. Trust the reranker — it is more accurate. If rankings are consistently wrong, try a different reranker model.

How do I handle reranking at very large scale?

Cache reranker embeddings or use a faster reranker model (TinyBERT vs. MiniLM). For 100K+ documents, use heuristic filtering (content length, recency) before reranking to reduce reranking load.

Further Reading