Skip to main content

Building Local Embeddings and Semantic Search

Embeddings convert text into dense numerical vectors that capture semantic meaning. Two texts with similar meaning have embeddings close in vector space. FAISS (Facebook AI Similarity Search) indexes these embeddings, enabling blazingly fast similarity search over millions of documents without cloud APIs.

This tutorial covers generating embeddings locally using Sentence Transformers, building FAISS indexes, similarity search, and retrieval-augmented generation (RAG). By the end, you'll build a semantic search engine that finds relevant documents in milliseconds.

Understanding Embeddings

An embedding is a vector of numbers representing the meaning of text. Sentence Transformers encode text into 384–1024 dimensional vectors where semantically similar texts cluster together:

Text: "I love playing basketball"
Embedding: [-0.234, 0.891, -0.123, ..., 0.456] # 384 numbers

Text: "I enjoy basketball games"
Embedding: [-0.241, 0.887, -0.119, ..., 0.459] # Similar to above

The distance between embeddings (usually cosine similarity) quantifies semantic similarity. If two texts are about the same topic, their embeddings are close (high similarity, low distance).

Installing and Loading Sentence Transformers

Sentence Transformers provide pre-trained models optimized for semantic similarity:

pip install sentence-transformers faiss-cpu
# For GPU: pip install faiss-gpu (requires CUDA)

Load a model:

from sentence_transformers import SentenceTransformer

# Load a small, fast model (33M parameters, 384 dimensions)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Or a larger, more accurate model (110M parameters, 384 dimensions)
# model = SentenceTransformer("all-mpnet-base-v2")

# Encode a single sentence
sentence = "Python is a great programming language."
embedding = model.encode(sentence)

print(f"Embedding shape: {embedding.shape}") # (384,)
print(f"Embedding (first 5 values): {embedding[:5]}")

Model choice trade-offs:

  • all-MiniLM-L6-v2 — Fast (1–2 ms per sentence), decent quality
  • all-mpnet-base-v2 — Slower (5–10 ms), highest quality
  • all-distilroberta-v1 — Balance of speed and quality

Building a FAISS Index

FAISS creates an index for fast nearest-neighbor search:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Sample documents
documents = [
"Python is used for web development.",
"JavaScript powers interactive websites.",
"Java is popular for enterprise applications.",
"Machine learning uses Python extensively.",
"React is a popular JavaScript library."
]

# Encode all documents
embeddings = model.encode(documents, convert_to_numpy=True)
embeddings = np.asarray(embeddings).astype('float32')

# Create FAISS index (L2 distance: Euclidean)
dimension = embeddings.shape[1] # 384 for all-MiniLM-L6-v2
index = faiss.IndexFlatL2(dimension)
index.add(embeddings) # Add embeddings to index

# Save index for later use
faiss.write_index(index, "documents.index")
print(f"Index created with {index.ntotal} documents")

FAISS index types:

  • IndexFlatL2 — Brute-force L2 (Euclidean) distance, accurate but slow for >1M vectors
  • IndexFlatIP — Inner product (cosine similarity)
  • IndexIVFFlat — Inverted file, fast for >100K vectors
  • IndexHNSW — Hierarchical Navigable Small World, very fast

For < 100K documents, IndexFlatL2 is fine.

Query the index to find similar documents:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Load index
index = faiss.read_index("documents.index")

documents = [
"Python is used for web development.",
"JavaScript powers interactive websites.",
"Java is popular for enterprise applications.",
"Machine learning uses Python extensively.",
"React is a popular JavaScript library."
]

# Query
query = "Machine learning with Python"
query_embedding = model.encode([query], convert_to_numpy=True)
query_embedding = np.asarray(query_embedding).astype('float32')

# Search (k=2 nearest neighbors)
distances, indices = index.search(query_embedding, k=2)

print(f"Query: {query}")
print(f"Top matches:")
for idx, distance in zip(indices[0], distances[0]):
similarity = 1 / (1 + distance) # Convert L2 distance to similarity
print(f" {documents[idx]} (similarity: {similarity:.2f})")

Output:

Query: Machine learning with Python
Top matches:
Machine learning uses Python extensively (similarity: 0.95)
Python is used for web development. (similarity: 0.78)

The k parameter controls how many neighbors to return. For ranking, return top-5; for filtering, return only top-2.

Retrieval-Augmented Generation (RAG)

Combine local embeddings with an LLM to answer questions using your documents:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load models
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
llm = AutoModelForCausalLM.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3", torch_dtype=torch.float16)
llm_tokenizer = AutoTokenizer.from_pretrained("mistral-community/Mistral-7B-Instruct-v0.3")

# Load index and documents
index = faiss.read_index("documents.index")
documents = [
"Python is used for web development.",
"JavaScript powers interactive websites.",
"Java is popular for enterprise applications.",
"Machine learning uses Python extensively.",
"React is a popular JavaScript library."
]

# RAG pipeline
def rag_query(question, k=2):
# Retrieve relevant documents
query_embedding = embedding_model.encode([question], convert_to_numpy=True).astype('float32')
distances, indices = index.search(query_embedding, k=k)

# Build context from top-k documents
context = "\n".join([documents[idx] for idx in indices[0]])

# Construct LLM prompt
prompt = f"""Context:
{context}

Question: {question}

Answer:"""

# Generate response
inputs = llm_tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = llm.generate(**inputs, max_length=150, temperature=0.7)
answer = llm_tokenizer.decode(output_ids[0], skip_special_tokens=True)

return answer

# Example query
result = rag_query("What languages are used for machine learning?")
print(result)

RAG improves LLM answers by grounding them in actual documents, reducing hallucinations and enabling domain-specific knowledge without fine-tuning.

Storing and Loading Embeddings

For large document collections, save embeddings to avoid recomputing:

import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
"Text 1",
"Text 2",
# ... up to millions of documents
]

# Compute embeddings
embeddings = model.encode(documents, batch_size=32, convert_to_numpy=True).astype('float32')

# Save embeddings and documents
np.save("embeddings.npy", embeddings)
with open("documents.txt", "w") as f:
for doc in documents:
f.write(doc + "\n")

# Save FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
faiss.write_index(index, "documents.index")

# Load later
embeddings = np.load("embeddings.npy")
with open("documents.txt", "r") as f:
documents = [line.strip() for line in f]
index = faiss.read_index("documents.index")

For 1M+ documents, use a more scalable format like SQLite or Postgres with pgvector.

Batch Encoding for Speed

Encoding thousands of documents is slow when done one-by-one. Use batch processing:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = ["Doc " + str(i) for i in range(10000)]

# Batch encoding (1000 documents at a time)
embeddings = model.encode(
documents,
batch_size=128, # Process 128 docs simultaneously
show_progress_bar=True,
convert_to_numpy=True
).astype('float32')

print(f"Encoded {len(embeddings)} documents in {embeddings.shape}")

Batching speeds up encoding by 5–10 vs. one-by-one (on GPU, even more).

Comparison Table: Embedding Models

ModelDimensionsSpeed (sents/sec)QualityVRAMBest For
all-MiniLM-L6-v2384500Good1 GBSpeed
all-mpnet-base-v2768150Best2 GBAccuracy
all-distilroberta-v1768250Good1.5 GBBalance
bge-small-en384400Good1 GBDomain-specific (English)

Key Takeaways

  • Embeddings convert text to vectors; semantic similarity = closeness in vector space.
  • FAISS indexes embeddings for fast retrieval (milliseconds on millions of documents).
  • Sentence Transformers provide pre-trained, efficient models for semantic search.
  • RAG combines embeddings with LLMs to answer questions grounded in documents.
  • Batch encoding speeds up large-scale embedding generation by 5–10.

Frequently Asked Questions

How do I choose embedding model dimensions?

Larger dimensions (768–1024) are slightly more accurate but slower. Smaller dimensions (384) are faster with minimal quality loss. For most use cases, 384 is fine.

Can I use FAISS with GPU?

Yes, pip install faiss-gpu uses CUDA. GPU FAISS is 10–100 faster for large indexes (>1M vectors). Install faiss-cpu for development, faiss-gpu for production.

How do I update documents in a FAISS index?

FAISS indexes are static. To update, rebuild the index. For dynamic updates, use Weaviate, Milvus, or Pinecone (vector databases).

What's the maximum index size FAISS supports?

IndexFlatL2 handles ~10M vectors on a 128 GB GPU. For larger collections, use IndexIVFFlat or a vector database.

Do embeddings work across languages?

Most Sentence Transformers are English-only. Use multilingual models like paraphrase-multilingual-MiniLM-L12-v2 for cross-language search.

Further Reading