Building Local Embeddings and Semantic Search
Embeddings convert text into dense numerical vectors that capture semantic meaning. Two texts with similar meaning have embeddings close in vector space. FAISS (Facebook AI Similarity Search) indexes these embeddings, enabling blazingly fast similarity search over millions of documents without cloud APIs.
This tutorial covers generating embeddings locally using Sentence Transformers, building FAISS indexes, similarity search, and retrieval-augmented generation (RAG). By the end, you'll build a semantic search engine that finds relevant documents in milliseconds.
Understanding Embeddings
An embedding is a vector of numbers representing the meaning of text. Sentence Transformers encode text into 384–1024 dimensional vectors where semantically similar texts cluster together:
Text: "I love playing basketball"
Embedding: [-0.234, 0.891, -0.123, ..., 0.456] # 384 numbers
Text: "I enjoy basketball games"
Embedding: [-0.241, 0.887, -0.119, ..., 0.459] # Similar to above
The distance between embeddings (usually cosine similarity) quantifies semantic similarity. If two texts are about the same topic, their embeddings are close (high similarity, low distance).
Installing and Loading Sentence Transformers
Sentence Transformers provide pre-trained models optimized for semantic similarity:
pip install sentence-transformers faiss-cpu
# For GPU: pip install faiss-gpu (requires CUDA)
Load a model:
from sentence_transformers import SentenceTransformer
# Load a small, fast model (33M parameters, 384 dimensions)
model = SentenceTransformer("all-MiniLM-L6-v2")
# Or a larger, more accurate model (110M parameters, 384 dimensions)
# model = SentenceTransformer("all-mpnet-base-v2")
# Encode a single sentence
sentence = "Python is a great programming language."
embedding = model.encode(sentence)
print(f"Embedding shape: {embedding.shape}") # (384,)
print(f"Embedding (first 5 values): {embedding[:5]}")
Model choice trade-offs:
all-MiniLM-L6-v2— Fast (1–2 ms per sentence), decent qualityall-mpnet-base-v2— Slower (5–10 ms), highest qualityall-distilroberta-v1— Balance of speed and quality
Building a FAISS Index
FAISS creates an index for fast nearest-neighbor search:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Sample documents
documents = [
"Python is used for web development.",
"JavaScript powers interactive websites.",
"Java is popular for enterprise applications.",
"Machine learning uses Python extensively.",
"React is a popular JavaScript library."
]
# Encode all documents
embeddings = model.encode(documents, convert_to_numpy=True)
embeddings = np.asarray(embeddings).astype('float32')
# Create FAISS index (L2 distance: Euclidean)
dimension = embeddings.shape[1] # 384 for all-MiniLM-L6-v2
index = faiss.IndexFlatL2(dimension)
index.add(embeddings) # Add embeddings to index
# Save index for later use
faiss.write_index(index, "documents.index")
print(f"Index created with {index.ntotal} documents")
FAISS index types:
IndexFlatL2— Brute-force L2 (Euclidean) distance, accurate but slow for >1M vectorsIndexFlatIP— Inner product (cosine similarity)IndexIVFFlat— Inverted file, fast for >100K vectorsIndexHNSW— Hierarchical Navigable Small World, very fast
For < 100K documents, IndexFlatL2 is fine.
Similarity Search
Query the index to find similar documents:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Load index
index = faiss.read_index("documents.index")
documents = [
"Python is used for web development.",
"JavaScript powers interactive websites.",
"Java is popular for enterprise applications.",
"Machine learning uses Python extensively.",
"React is a popular JavaScript library."
]
# Query
query = "Machine learning with Python"
query_embedding = model.encode([query], convert_to_numpy=True)
query_embedding = np.asarray(query_embedding).astype('float32')
# Search (k=2 nearest neighbors)
distances, indices = index.search(query_embedding, k=2)
print(f"Query: {query}")
print(f"Top matches:")
for idx, distance in zip(indices[0], distances[0]):
similarity = 1 / (1 + distance) # Convert L2 distance to similarity
print(f" {documents[idx]} (similarity: {similarity:.2f})")
Output:
Query: Machine learning with Python
Top matches:
Machine learning uses Python extensively (similarity: 0.95)
Python is used for web development. (similarity: 0.78)
The k parameter controls how many neighbors to return. For ranking, return top-5; for filtering, return only top-2.
Retrieval-Augmented Generation (RAG)
Combine local embeddings with an LLM to answer questions using your documents:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load models
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
llm = AutoModelForCausalLM.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3", torch_dtype=torch.float16)
llm_tokenizer = AutoTokenizer.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3")
# Load index and documents
index = faiss.read_index("documents.index")
documents = [
"Python is used for web development.",
"JavaScript powers interactive websites.",
"Java is popular for enterprise applications.",
"Machine learning uses Python extensively.",
"React is a popular JavaScript library."
]
# RAG pipeline
def rag_query(question, k=2):
# Retrieve relevant documents
query_embedding = embedding_model.encode([question], convert_to_numpy=True).astype('float32')
distances, indices = index.search(query_embedding, k=k)
# Build context from top-k documents
context = "\n".join([documents[idx] for idx in indices[0]])
# Construct LLM prompt
prompt = f"""Context:
{context}
Question: {question}
Answer:"""
# Generate response
inputs = llm_tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = llm.generate(**inputs, max_length=150, temperature=0.7)
answer = llm_tokenizer.decode(output_ids[0], skip_special_tokens=True)
return answer
# Example query
result = rag_query("What languages are used for machine learning?")
print(result)
RAG improves LLM answers by grounding them in actual documents, reducing hallucinations and enabling domain-specific knowledge without fine-tuning.
Storing and Loading Embeddings
For large document collections, save embeddings to avoid recomputing:
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
"Text 1",
"Text 2",
# ... up to millions of documents
]
# Compute embeddings
embeddings = model.encode(documents, batch_size=32, convert_to_numpy=True).astype('float32')
# Save embeddings and documents
np.save("embeddings.npy", embeddings)
with open("documents.txt", "w") as f:
for doc in documents:
f.write(doc + "\n")
# Save FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
faiss.write_index(index, "documents.index")
# Load later
embeddings = np.load("embeddings.npy")
with open("documents.txt", "r") as f:
documents = [line.strip() for line in f]
index = faiss.read_index("documents.index")
For 1M+ documents, use a more scalable format like SQLite or Postgres with pgvector.
Batch Encoding for Speed
Encoding thousands of documents is slow when done one-by-one. Use batch processing:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = ["Doc " + str(i) for i in range(10000)]
# Batch encoding (1000 documents at a time)
embeddings = model.encode(
documents,
batch_size=128, # Process 128 docs simultaneously
show_progress_bar=True,
convert_to_numpy=True
).astype('float32')
print(f"Encoded {len(embeddings)} documents in {embeddings.shape}")
Batching speeds up encoding by 5–10 vs. one-by-one (on GPU, even more).
Comparison Table: Embedding Models
| Model | Dimensions | Speed (sents/sec) | Quality | VRAM | Best For |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 500 | Good | 1 GB | Speed |
| all-mpnet-base-v2 | 768 | 150 | Best | 2 GB | Accuracy |
| all-distilroberta-v1 | 768 | 250 | Good | 1.5 GB | Balance |
| bge-small-en | 384 | 400 | Good | 1 GB | Domain-specific (English) |
Key Takeaways
- Embeddings convert text to vectors; semantic similarity = closeness in vector space.
- FAISS indexes embeddings for fast retrieval (milliseconds on millions of documents).
- Sentence Transformers provide pre-trained, efficient models for semantic search.
- RAG combines embeddings with LLMs to answer questions grounded in documents.
- Batch encoding speeds up large-scale embedding generation by 5–10.
Frequently Asked Questions
How do I choose embedding model dimensions?
Larger dimensions (768–1024) are slightly more accurate but slower. Smaller dimensions (384) are faster with minimal quality loss. For most use cases, 384 is fine.
Can I use FAISS with GPU?
Yes, pip install faiss-gpu uses CUDA. GPU FAISS is 10–100 faster for large indexes (>1M vectors). Install faiss-cpu for development, faiss-gpu for production.
How do I update documents in a FAISS index?
FAISS indexes are static. To update, rebuild the index. For dynamic updates, use Weaviate, Milvus, or Pinecone (vector databases).
What's the maximum index size FAISS supports?
IndexFlatL2 handles ~10M vectors on a 128 GB GPU. For larger collections, use IndexIVFFlat or a vector database.
Do embeddings work across languages?
Most Sentence Transformers are English-only. Use multilingual models like paraphrase-multilingual-MiniLM-L12-v2 for cross-language search.
Further Reading
- Sentence Transformers Documentation — Official models and examples.
- FAISS GitHub — Index types, benchmarks, and GPU setup.
- RAG Survey — Overview of retrieval-augmented generation.
- Embeddings Paper — How Sentence Transformers work.