RAG (Retrieval-Augmented Generation)
What Is RAG?
RAG is a pattern where you retrieve relevant documents first, then feed them to the LLM as context so it can answer questions grounded in your data — not just its training data.
Without RAG: User question → LLM (uses training data only) → May hallucinate
With RAG: User question → Search your docs → LLM (uses your docs + training) → Grounded answer
This is how "chat with your docs" products work. It is the most practical way to give an LLM access to private, up-to-date, or domain-specific knowledge.
The Full RAG Pipeline
Step-by-Step
┌─────────────────── INDEXING (done once, updated as docs change) ───────────────────┐
│ │
│ 1. LOAD 2. CHUNK 3. EMBED 4. STORE │
│ PDF, MD, HTML → Split into → Convert chunks → Save vectors │
│ code, docs small pieces to vectors in vector DB │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────── QUERYING (every user question) ─────────────────────────────────┐
│ │
│ 5. EMBED QUERY 6. SEARCH 7. AUGMENT 8. GENERATE │
│ User question → Find similar → Build prompt with → LLM answers │
│ → vector chunks in DB retrieved chunks using context │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
Step 1: Document Loading
Load your source documents into raw text:
| Source | Format | Tool Examples |
|---|---|---|
| PDFs | Structured text with layout | PyPDF, Unstructured, pdfplumber |
| Markdown | Clean text, easy to parse | Built-in parsers |
| HTML | Web pages, needs cleanup | BeautifulSoup, Trafilatura |
| Code | Source files | Tree-sitter for structure-aware parsing |
| Databases | Rows/records to text | SQL queries, export scripts |
Step 2: Chunking
LLMs have context limits, and embeddings work better on focused text. Split documents into chunks.
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed size | Split every N tokens (e.g., 300) | Simple, predictable |
| Recursive | Split by paragraphs, then sentences, then tokens | General purpose (most common) |
| Semantic | Split when the topic changes (using embeddings) | Long documents with topic shifts |
| By structure | Split on headers, sections, code blocks | Markdown, HTML, code |
Chunking Best Practices
Chunk size: 200-500 tokens per chunk (sweet spot)
Overlap: 50-100 tokens between consecutive chunks
(prevents losing context at boundaries)
Metadata: Store source file, page number, section title with each chunk
Example of overlap:
Chunk 1: "...transformers use self-attention to process tokens in parallel.
This allows the model to capture long-range dependencies."
Chunk 2: "This allows the model to capture long-range dependencies.
Unlike RNNs, transformers don't process tokens sequentially..."
↑ overlap — same sentence appears in both chunks
Step 3: Embedding
Convert each chunk into a vector (list of numbers) that captures its meaning. Similar chunks have similar vectors. (See 05 - Embeddings & Vector Search for the theory.)
pythonfrom openai import OpenAI client = OpenAI() def embed_texts(texts: list[str]) -> list[list[float]]: response = client.embeddings.create( model="text-embedding-3-small", input=texts ) return [item.embedding for item in response.data] chunks = ["Transformers use self-attention...", "RAG retrieves documents..."] vectors = embed_texts(chunks) # Each vector: [0.023, -0.041, 0.078, ...] (1536 dimensions)
Embedding Models
| Model | Provider | Dimensions | Notes |
|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | Good balance of quality and cost |
| text-embedding-3-large | OpenAI | 3072 | Higher quality, more expensive |
| embed-english-v3.0 | Cohere | 1024 | Strong retrieval performance |
| BAAI/bge-large | Open source | 1024 | Run locally, no API cost |
| nomic-embed-text | Open source | 768 | Good quality, runs on CPU |
Step 4: Vector Storage
Store embeddings in a vector database for fast similarity search:
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud | Production, scales to billions |
| pgvector | PostgreSQL extension | Already using Postgres |
| Chroma | Lightweight, local | Prototyping, small datasets |
| Weaviate | Cloud or self-hosted | Hybrid search built in |
| Qdrant | Cloud or self-hosted | High performance, rich filtering |
Example: Storing in Chroma
pythonimport chromadb client = chromadb.Client() collection = client.create_collection("my_docs") collection.add( ids=["chunk_1", "chunk_2", "chunk_3"], documents=["Transformers use...", "RAG retrieves...", "Embeddings capture..."], metadatas=[ {"source": "transformers.md", "page": 1}, {"source": "rag.md", "page": 3}, {"source": "embeddings.md", "page": 1}, ] )
Steps 5-6: Query and Search
When a user asks a question:
python# Embed the question using the SAME embedding model query = "How does self-attention work?" results = collection.query( query_texts=[query], n_results=5 # top K chunks ) # Returns the 5 most semantically similar chunks
Retrieval Strategies
| Strategy | How | When |
|---|---|---|
| Semantic search | Embed query, find nearest vectors | Default approach |
| Keyword search | BM25 / full-text search | Exact term matching (names, codes) |
| Hybrid | Combine semantic + keyword, merge results | Best of both worlds (recommended) |
| Reranking | Retrieve 20 chunks, then rerank to pick best 5 | Higher quality, slightly slower |
Reranking
Initial retrieval casts a wide net. A reranker (a separate model) rescores results for relevance:
python# 1. Retrieve top 20 chunks (fast, broad) candidates = vector_db.search(query, top_k=20) # 2. Rerank to find the best 5 (slower, more precise) reranked = reranker.rank(query, candidates, top_k=5) # Reranker models: Cohere Rerank, cross-encoders, bge-reranker
Steps 7-8: Augment and Generate
Build a prompt that combines instructions, retrieved context, and the question:
pythondef build_rag_prompt(question: str, chunks: list[str]) -> list[dict]: context = "\n\n---\n\n".join(chunks) return [ { "role": "user", "content": f"""Answer the question based on the provided context. If the context doesn't contain the answer, say "I don't have enough information." Context: {context} Question: {question}""" } ] messages = build_rag_prompt("How does self-attention work?", retrieved_chunks) response = client.messages.create( model="claude-sonnet-4-6-20250514", max_tokens=1024, messages=messages )
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Chunks too large (>1000 tokens) | Retrieves vaguely relevant walls of text | Reduce to 200-500 tokens |
| Chunks too small (<50 tokens) | Loses context, fragments meaning | Increase size, add overlap |
| No overlap | Answers missed at chunk boundaries | Add 50-100 token overlap |
| Wrong embedding model | Poor retrieval quality | Test multiple models on your data |
| No metadata filtering | Retrieves from wrong document/section | Add source, date, category metadata |
| Stale index | Answers based on outdated documents | Build an update pipeline |
Production Considerations
- Document update pipeline — watch for file changes, re-chunk and re-embed incrementally
- Metadata filtering — filter by date, source, category before vector search
- Caching — cache frequent queries and their results
- Evaluation — measure retrieval precision, answer accuracy, and faithfulness (see 18 - Evaluations & Testing AI)
- Hybrid search — combine keyword and semantic for best results
- Chunk citations — return source references so users can verify answers
Tools & Frameworks
| Tool | What It Does |
|---|---|
| LangChain | Full RAG pipeline with many integrations |
| LlamaIndex | Purpose-built for RAG, strong indexing features |
| Haystack | Production RAG pipelines with modular components |
| Unstructured | Document parsing (PDF, DOCX, HTML) |
Resources
- Anthropic RAG Guide
- Pinecone Learning Center
- LlamaIndex Documentation
- Chunking Strategies Comparison
Previous: 16 - AI Agents | Next: 18 - Evaluations & Testing AI