RAG (Retrieval-Augmented Generation)

What Is RAG?

RAG is a pattern where you retrieve relevant documents first, then feed them to the LLM as context so it can answer questions grounded in your data — not just its training data.

Without RAG:  User question → LLM (uses training data only) → May hallucinate
With RAG:     User question → Search your docs → LLM (uses your docs + training) → Grounded answer

This is how "chat with your docs" products work. It is the most practical way to give an LLM access to private, up-to-date, or domain-specific knowledge.


The Full RAG Pipeline

Step-by-Step

┌─────────────────── INDEXING (done once, updated as docs change) ───────────────────┐
│                                                                                     │
│  1. LOAD         2. CHUNK         3. EMBED           4. STORE                       │
│  PDF, MD, HTML → Split into      → Convert chunks    → Save vectors                │
│  code, docs      small pieces      to vectors          in vector DB                 │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────── QUERYING (every user question) ─────────────────────────────────┐
│                                                                                     │
│  5. EMBED QUERY   6. SEARCH        7. AUGMENT          8. GENERATE                 │
│  User question → Find similar    → Build prompt with  → LLM answers                │
│  → vector         chunks in DB     retrieved chunks     using context               │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘

Step 1: Document Loading

Load your source documents into raw text:

SourceFormatTool Examples
PDFsStructured text with layoutPyPDF, Unstructured, pdfplumber
MarkdownClean text, easy to parseBuilt-in parsers
HTMLWeb pages, needs cleanupBeautifulSoup, Trafilatura
CodeSource filesTree-sitter for structure-aware parsing
DatabasesRows/records to textSQL queries, export scripts

Step 2: Chunking

LLMs have context limits, and embeddings work better on focused text. Split documents into chunks.

Chunking Strategies

StrategyHow It WorksBest For
Fixed sizeSplit every N tokens (e.g., 300)Simple, predictable
RecursiveSplit by paragraphs, then sentences, then tokensGeneral purpose (most common)
SemanticSplit when the topic changes (using embeddings)Long documents with topic shifts
By structureSplit on headers, sections, code blocksMarkdown, HTML, code

Chunking Best Practices

Chunk size:    200-500 tokens per chunk (sweet spot)
Overlap:       50-100 tokens between consecutive chunks
               (prevents losing context at boundaries)
Metadata:      Store source file, page number, section title with each chunk

Example of overlap:

Chunk 1: "...transformers use self-attention to process tokens in parallel.
          This allows the model to capture long-range dependencies."

Chunk 2: "This allows the model to capture long-range dependencies.
          Unlike RNNs, transformers don't process tokens sequentially..."
          ↑ overlap — same sentence appears in both chunks

Step 3: Embedding

Convert each chunk into a vector (list of numbers) that captures its meaning. Similar chunks have similar vectors. (See 05 - Embeddings & Vector Search for the theory.)

python
from openai import OpenAI client = OpenAI() def embed_texts(texts: list[str]) -> list[list[float]]: response = client.embeddings.create( model="text-embedding-3-small", input=texts ) return [item.embedding for item in response.data] chunks = ["Transformers use self-attention...", "RAG retrieves documents..."] vectors = embed_texts(chunks) # Each vector: [0.023, -0.041, 0.078, ...] (1536 dimensions)

Embedding Models

ModelProviderDimensionsNotes
text-embedding-3-smallOpenAI1536Good balance of quality and cost
text-embedding-3-largeOpenAI3072Higher quality, more expensive
embed-english-v3.0Cohere1024Strong retrieval performance
BAAI/bge-largeOpen source1024Run locally, no API cost
nomic-embed-textOpen source768Good quality, runs on CPU

Step 4: Vector Storage

Store embeddings in a vector database for fast similarity search:

DatabaseTypeBest For
PineconeManaged cloudProduction, scales to billions
pgvectorPostgreSQL extensionAlready using Postgres
ChromaLightweight, localPrototyping, small datasets
WeaviateCloud or self-hostedHybrid search built in
QdrantCloud or self-hostedHigh performance, rich filtering

Example: Storing in Chroma

python
import chromadb client = chromadb.Client() collection = client.create_collection("my_docs") collection.add( ids=["chunk_1", "chunk_2", "chunk_3"], documents=["Transformers use...", "RAG retrieves...", "Embeddings capture..."], metadatas=[ {"source": "transformers.md", "page": 1}, {"source": "rag.md", "page": 3}, {"source": "embeddings.md", "page": 1}, ] )

Steps 5-6: Query and Search

When a user asks a question:

python
# Embed the question using the SAME embedding model query = "How does self-attention work?" results = collection.query( query_texts=[query], n_results=5 # top K chunks ) # Returns the 5 most semantically similar chunks

Retrieval Strategies

StrategyHowWhen
Semantic searchEmbed query, find nearest vectorsDefault approach
Keyword searchBM25 / full-text searchExact term matching (names, codes)
HybridCombine semantic + keyword, merge resultsBest of both worlds (recommended)
RerankingRetrieve 20 chunks, then rerank to pick best 5Higher quality, slightly slower

Reranking

Initial retrieval casts a wide net. A reranker (a separate model) rescores results for relevance:

python
# 1. Retrieve top 20 chunks (fast, broad) candidates = vector_db.search(query, top_k=20) # 2. Rerank to find the best 5 (slower, more precise) reranked = reranker.rank(query, candidates, top_k=5) # Reranker models: Cohere Rerank, cross-encoders, bge-reranker

Steps 7-8: Augment and Generate

Build a prompt that combines instructions, retrieved context, and the question:

python
def build_rag_prompt(question: str, chunks: list[str]) -> list[dict]: context = "\n\n---\n\n".join(chunks) return [ { "role": "user", "content": f"""Answer the question based on the provided context. If the context doesn't contain the answer, say "I don't have enough information." Context: {context} Question: {question}""" } ] messages = build_rag_prompt("How does self-attention work?", retrieved_chunks) response = client.messages.create( model="claude-sonnet-4-6-20250514", max_tokens=1024, messages=messages )

Common Pitfalls

PitfallSymptomFix
Chunks too large (>1000 tokens)Retrieves vaguely relevant walls of textReduce to 200-500 tokens
Chunks too small (<50 tokens)Loses context, fragments meaningIncrease size, add overlap
No overlapAnswers missed at chunk boundariesAdd 50-100 token overlap
Wrong embedding modelPoor retrieval qualityTest multiple models on your data
No metadata filteringRetrieves from wrong document/sectionAdd source, date, category metadata
Stale indexAnswers based on outdated documentsBuild an update pipeline

Production Considerations

  • Document update pipeline — watch for file changes, re-chunk and re-embed incrementally
  • Metadata filtering — filter by date, source, category before vector search
  • Caching — cache frequent queries and their results
  • Evaluation — measure retrieval precision, answer accuracy, and faithfulness (see 18 - Evaluations & Testing AI)
  • Hybrid search — combine keyword and semantic for best results
  • Chunk citations — return source references so users can verify answers

Tools & Frameworks

ToolWhat It Does
LangChainFull RAG pipeline with many integrations
LlamaIndexPurpose-built for RAG, strong indexing features
HaystackProduction RAG pipelines with modular components
UnstructuredDocument parsing (PDF, DOCX, HTML)

Resources


Previous: 16 - AI Agents | Next: 18 - Evaluations & Testing AI