RAG (Retrieval-Augmented Generation)

What Is RAG?

RAG is a pattern where you retrieve relevant documents first, then feed them to the LLM as context so it can answer questions grounded in your data — not just its training data.

Without RAG:  User question → LLM (uses training data only) → May hallucinate
With RAG:     User question → Search your docs → LLM (uses your docs + training) → Grounded answer

This is how "chat with your docs" products work. It is the most practical way to give an LLM access to private, up-to-date, or domain-specific knowledge.

The Full RAG Pipeline

Step-by-Step

┌─────────────────── INDEXING (done once, updated as docs change) ───────────────────┐
│                                                                                     │
│  1. LOAD         2. CHUNK         3. EMBED           4. STORE                       │
│  PDF, MD, HTML → Split into      → Convert chunks    → Save vectors                │
│  code, docs      small pieces      to vectors          in vector DB                 │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────── QUERYING (every user question) ─────────────────────────────────┐
│                                                                                     │
│  5. EMBED QUERY   6. SEARCH        7. AUGMENT          8. GENERATE                 │
│  User question → Find similar    → Build prompt with  → LLM answers                │
│  → vector         chunks in DB     retrieved chunks     using context               │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘

Step 1: Document Loading

Load your source documents into raw text:

Source	Format	Tool Examples
PDFs	Structured text with layout	PyPDF, Unstructured, pdfplumber
Markdown	Clean text, easy to parse	Built-in parsers
HTML	Web pages, needs cleanup	BeautifulSoup, Trafilatura
Code	Source files	Tree-sitter for structure-aware parsing
Databases	Rows/records to text	SQL queries, export scripts

Step 2: Chunking

LLMs have context limits, and embeddings work better on focused text. Split documents into chunks.

Chunking Strategies

Strategy	How It Works	Best For
Fixed size	Split every N tokens (e.g., 300)	Simple, predictable
Recursive	Split by paragraphs, then sentences, then tokens	General purpose (most common)
Semantic	Split when the topic changes (using embeddings)	Long documents with topic shifts
By structure	Split on headers, sections, code blocks	Markdown, HTML, code

Chunking Best Practices

Chunk size:    200-500 tokens per chunk (sweet spot)
Overlap:       50-100 tokens between consecutive chunks
               (prevents losing context at boundaries)
Metadata:      Store source file, page number, section title with each chunk

Example of overlap:

Chunk 1: "...transformers use self-attention to process tokens in parallel.
          This allows the model to capture long-range dependencies."

Chunk 2: "This allows the model to capture long-range dependencies.
          Unlike RNNs, transformers don't process tokens sequentially..."
          ↑ overlap — same sentence appears in both chunks

Step 3: Embedding

Convert each chunk into a vector (list of numbers) that captures its meaning. Similar chunks have similar vectors. (See 05 - Embeddings & Vector Search for the theory.)

python
from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

chunks = ["Transformers use self-attention...", "RAG retrieves documents..."]
vectors = embed_texts(chunks)
# Each vector: [0.023, -0.041, 0.078, ...] (1536 dimensions)

Embedding Models

Model	Provider	Dimensions	Notes
text-embedding-3-small	OpenAI	1536	Good balance of quality and cost
text-embedding-3-large	OpenAI	3072	Higher quality, more expensive
embed-english-v3.0	Cohere	1024	Strong retrieval performance
BAAI/bge-large	Open source	1024	Run locally, no API cost
nomic-embed-text	Open source	768	Good quality, runs on CPU

Step 4: Vector Storage

Store embeddings in a vector database for fast similarity search:

Database	Type	Best For
Pinecone	Managed cloud	Production, scales to billions
pgvector	PostgreSQL extension	Already using Postgres
Chroma	Lightweight, local	Prototyping, small datasets
Weaviate	Cloud or self-hosted	Hybrid search built in
Qdrant	Cloud or self-hosted	High performance, rich filtering

Example: Storing in Chroma

python
import chromadb

client = chromadb.Client()
collection = client.create_collection("my_docs")

collection.add(
    ids=["chunk_1", "chunk_2", "chunk_3"],
    documents=["Transformers use...", "RAG retrieves...", "Embeddings capture..."],
    metadatas=[
        {"source": "transformers.md", "page": 1},
        {"source": "rag.md", "page": 3},
        {"source": "embeddings.md", "page": 1},
    ]
)

Steps 5-6: Query and Search

When a user asks a question:

python
# Embed the question using the SAME embedding model
query = "How does self-attention work?"

results = collection.query(
    query_texts=[query],
    n_results=5  # top K chunks
)

# Returns the 5 most semantically similar chunks

Retrieval Strategies

Strategy	How	When
Semantic search	Embed query, find nearest vectors	Default approach
Keyword search	BM25 / full-text search	Exact term matching (names, codes)
Hybrid	Combine semantic + keyword, merge results	Best of both worlds (recommended)
Reranking	Retrieve 20 chunks, then rerank to pick best 5	Higher quality, slightly slower

Reranking

Initial retrieval casts a wide net. A reranker (a separate model) rescores results for relevance:

python
# 1. Retrieve top 20 chunks (fast, broad)
candidates = vector_db.search(query, top_k=20)

# 2. Rerank to find the best 5 (slower, more precise)
reranked = reranker.rank(query, candidates, top_k=5)
# Reranker models: Cohere Rerank, cross-encoders, bge-reranker

Steps 7-8: Augment and Generate

Build a prompt that combines instructions, retrieved context, and the question:

python
def build_rag_prompt(question: str, chunks: list[str]) -> list[dict]:
    context = "\n\n---\n\n".join(chunks)
    return [
        {
            "role": "user",
            "content": f"""Answer the question based on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {question}"""
        }
    ]

messages = build_rag_prompt("How does self-attention work?", retrieved_chunks)
response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    messages=messages
)

Common Pitfalls

Pitfall	Symptom	Fix
Chunks too large (>1000 tokens)	Retrieves vaguely relevant walls of text	Reduce to 200-500 tokens
Chunks too small (<50 tokens)	Loses context, fragments meaning	Increase size, add overlap
No overlap	Answers missed at chunk boundaries	Add 50-100 token overlap
Wrong embedding model	Poor retrieval quality	Test multiple models on your data
No metadata filtering	Retrieves from wrong document/section	Add source, date, category metadata
Stale index	Answers based on outdated documents	Build an update pipeline

Production Considerations

Document update pipeline — watch for file changes, re-chunk and re-embed incrementally
Metadata filtering — filter by date, source, category before vector search
Caching — cache frequent queries and their results
Evaluation — measure retrieval precision, answer accuracy, and faithfulness (see 18 - Evaluations & Testing AI)
Hybrid search — combine keyword and semantic for best results
Chunk citations — return source references so users can verify answers

Tools & Frameworks

Tool	What It Does
LangChain	Full RAG pipeline with many integrations
LlamaIndex	Purpose-built for RAG, strong indexing features
Haystack	Production RAG pipelines with modular components
Unstructured	Document parsing (PDF, DOCX, HTML)

Resources

Previous: 16 - AI Agents | Next: 18 - Evaluations & Testing AI