Embeddings & Vector Search
What Are Embeddings?
An embedding converts text into a list of numbers (a vector) that captures its meaning.
"I love cats" → [0.23, -0.45, 0.89, 0.12, ..., -0.34] (1536 numbers)
"I adore cats" → [0.24, -0.44, 0.88, 0.13, ..., -0.33] (very similar!)
"Buy stocks" → [0.91, 0.02, -0.56, 0.77, ..., 0.45] (very different)
Key insight: Similar meaning = similar vectors. Unrelated meaning = distant vectors.
Analogy: GPS coordinates for meaning. "Paris" and "Lyon" have close coordinates (both in France). "Paris" and "Tokyo" have distant coordinates. Embeddings do the same but for meaning in 1000+ dimensions.
The Classic Example
king - man + woman ≈ queen
Vector math:
[0.5, 0.8, 0.3] - [0.4, 0.1, 0.5] + [0.6, 0.2, 0.7] = [0.7, 0.9, 0.5]
Closest vector to [0.7, 0.9, 0.5] in vocabulary → "queen"
The model learned that the relationship between "king" and "man" is similar to the relationship between "queen" and "woman" — without anyone teaching it this explicitly.
How Similarity Is Measured
Cosine Similarity
Most common metric. Measures the angle between two vectors (ignores magnitude).
similarity = (A · B) / (|A| × |B|)
Result range:
1.0 = identical meaning
0.0 = unrelated
-1.0 = opposite meaning
Example:
sim("I love cats", "I adore kittens") = 0.95 (very similar)
sim("I love cats", "The weather is nice") = 0.15 (unrelated)
sim("I love cats", "I hate cats") = 0.65 (related topic, different sentiment)
Other Distance Metrics
| Metric | How It Works | Used For |
|---|---|---|
| Cosine similarity | Angle between vectors | Most text similarity tasks |
| Euclidean distance | Straight-line distance | When magnitude matters |
| Dot product | Raw dot product | Optimized retrieval |
Embedding Models
| Model | Dimensions | Provider | Notes |
|---|---|---|---|
| text-embedding-3-small | 1536 | OpenAI | Good quality, cheap |
| text-embedding-3-large | 3072 | OpenAI | Best quality from OpenAI |
| embed-v4 | 1024 | Cohere | Good multilingual |
| BGE-large-en | 1024 | Open Source (BAAI) | Best open-source |
| E5-mistral-7b | 4096 | Open Source | Instruction-following embeddings |
| Voyage-3 | 1024 | Voyage AI | Strong for code |
Pricing: Very cheap compared to LLMs. OpenAI text-embedding-3-small: $0.02 per 1M tokens.
Vector Databases
Regular databases search by exact match or range. Vector databases search by similarity.
Traditional DB:
SELECT * FROM products WHERE category = 'shoes' AND price < 100
Vector DB:
"Find me products similar to 'comfortable running shoes for rainy weather'"
→ Returns products with vectors closest to the query vector
Popular Vector Databases
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud | Production, zero-ops |
| Weaviate | Open source + cloud | Hybrid search (vector + keyword) |
| Qdrant | Open source + cloud | High performance, filtering |
| Chroma | Open source | Prototyping, simple local use |
| pgvector | PostgreSQL extension | Already using PostgreSQL |
| Milvus | Open source | Large-scale (billions of vectors) |
| FAISS | Library (Meta) | In-memory, fastest for research |
How Vector Search Works (ANN)
Searching ALL vectors for the closest match is O(n) — too slow for millions of vectors.
Approximate Nearest Neighbor (ANN) algorithms:
HNSW (Hierarchical Navigable Small World):
Build a multi-layer graph:
Top layer: few nodes, long-range connections (navigate quickly)
Middle layers: more nodes, medium connections
Bottom layer: all nodes, short-range connections (precise search)
Search: Start at top → hop through graph → drill down → find nearest
- ~95-99% recall (finds true nearest neighbor 95%+ of the time)
- Sub-millisecond search over millions of vectors
- Used by: Qdrant, Weaviate, pgvector
IVF (Inverted File Index):
- Cluster vectors into groups
- At query time, only search the closest clusters
- Faster but lower recall than HNSW
The RAG Connection
Embeddings + vector search = the foundation of RAG (Retrieval-Augmented Generation).
Setup (one-time):
Your documents → Split into chunks → Embed each chunk → Store in vector DB
At query time:
User question → Embed question → Search vector DB → Get top 5 similar chunks
→ Send chunks + question to LLM → LLM answers using the chunks
User: "How do I configure the payment webhook?"
1. Embed question → [0.12, 0.89, ...]
2. Vector search → finds chunks from payment-docs.md and webhook-guide.md
3. Send to Claude: "Using this context: [chunks], answer: How do I configure..."
4. Claude answers with grounded, accurate information from YOUR docs
See 17 - RAG (Retrieval-Augmented Generation) for the complete pipeline.
Use Cases Beyond RAG
| Use Case | How |
|---|---|
| Semantic search | Search by meaning, not just keywords |
| Recommendation | "Users who liked X" → find similar embeddings |
| Clustering | Group similar documents/tickets/feedback |
| Deduplication | Find near-duplicate content |
| Classification | Compare against labeled examples |
| Anomaly detection | Find vectors far from normal clusters |
Practical Example: Building a Simple Semantic Search
pythonfrom openai import OpenAI import numpy as np client = OpenAI() # Your documents docs = [ "Python is a versatile programming language", "JavaScript runs in the browser", "Machine learning uses statistical models", "The weather in Paris is mild", ] # Embed all documents doc_embeddings = [] for doc in docs: response = client.embeddings.create( input=doc, model="text-embedding-3-small" ) doc_embeddings.append(response.data[0].embedding) # User query query = "What language is good for web development?" query_resp = client.embeddings.create( input=query, model="text-embedding-3-small" ) query_embedding = query_resp.data[0].embedding # Find most similar (cosine similarity) similarities = [ np.dot(query_embedding, doc_emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)) for doc_emb in doc_embeddings ] # Result: "JavaScript runs in the browser" scores highest ✅
Resources
- 🔗 OpenAI — Embeddings Guide
- 🎥 3Blue1Brown — Word Embeddings
- 🔗 Pinecone — What Are Embeddings
- 🔗 MTEB Leaderboard — Compare Embedding Models
- 🔗 pgvector — PostgreSQL Extension
Previous: 04 - Temperature, Top-P & Sampling | Next: 06 - Fine-Tuning vs Prompting vs RAG