Context Window

What Is the Context Window?

The context window is the model's working memory — the total amount of text (in tokens) it can see at one time. Everything in the window is visible. Everything outside it doesn't exist to the model.

Analogy: Imagine a desk. The context window is the size of that desk. All your papers (conversation, documents, code) must fit on it. Bigger desk = more documents open at once.

┌─────────────────── Context Window (e.g., 200K tokens) ──────────────────┐
│                                                                         │
│  [System Prompt]  [Message 1]  [Reply 1]  [Message 2]  [Reply 2]  ...   │
│  (hidden from     (user)       (assistant) (user)       (assistant)     │
│   user, always                                                          │
│   present)                                                              │
│                                                                         │
│  ◄────────────── Input Tokens ──────────────► ◄── Output Tokens ──►     │
│                (you pay for these)              (you pay for these)     │
└─────────────────────────────────────────────────────────────────────────┘

Context Window Sizes (2025-2026)

ModelContext WindowIn Pages of Text (~500 tokens/page)
GPT-4o128K tokens~256 pages
GPT-4o mini128K tokens~256 pages
Claude Sonnet 4.6200K tokens~400 pages
Claude Opus 4.6200K → 1M tokens~400 → 2,000 pages
Claude Haiku 4.5200K tokens~400 pages
Gemini 2.0 Flash1M tokens~2,000 pages
Gemini 1.5 Pro2M tokens~4,000 pages

1M tokens ≈ an entire codebase, or multiple full-length novels, or thousands of pages of documentation.


What Goes Into the Context Window

Every API call sends the entire conversation to the model:

API Call for Message 5:
  tokens_sent = system_prompt          (~500 tokens)
              + message_1              (~200 tokens)
              + reply_1                (~800 tokens)
              + message_2              (~150 tokens)
              + reply_2                (~600 tokens)
              + message_3              (~300 tokens)
              + reply_3                (~1000 tokens)
              + message_4              (~100 tokens)
              + reply_4                (~500 tokens)
              + message_5              (~200 tokens)
              ─────────────────────────────────────
              Total input: ~4,350 tokens
              + model generates reply_5 (~800 tokens output)

Key insight: You're paying for ALL previous messages EVERY time. Message 20 in a conversation sends messages 1-19 again.


Input Tokens vs Output Tokens

Input TokensOutput Tokens
WhatEverything you send (system prompt + history + your message)What the model generates
LimitUp to full context windowSeparate limit (typically 4K-32K)
CostCheaper (e.g., $3/1M for Sonnet)More expensive (e.g., $15/1M for Sonnet)
SpeedProcessed quickly (parallel)Generated slowly (sequential, ~50-100 tokens/sec)

Max output tokens is separate from context window:

  • Claude: default 4,096, can set up to 64,000 (Sonnet) or 32,000 (Opus)
  • GPT-4o: up to 16,384 output tokens
  • You can request longer outputs via max_tokens parameter

What Happens When You Exceed the Window

In APIs

  • Error: API returns an error saying input exceeds maximum context length
  • You must: Truncate conversation history, summarize, or remove old messages

In Chat Interfaces (Claude.ai, ChatGPT)

  • Automatic truncation: Oldest messages silently dropped
  • You notice: Model "forgets" things from earlier in the conversation
  • Claude Code: Automatic compression of older messages when approaching limits

The "Lost in the Middle" Problem

Research shows models pay less attention to the middle of long contexts:

Attention Level:
  ██████████  Beginning of context (HIGH attention)
  ████        Middle of context (LOWER attention)
  ████████    End of context (HIGH attention)

Practical impact: If you paste 50 files and ask about one, the model performs better if the relevant file is near the beginning or end, not buried in the middle.

Mitigation:

  • Put the most important context at the START or END
  • Explicitly reference what the model should focus on
  • Use RAG to only include relevant chunks instead of everything

Context Window Management Strategies

1. Don't Stuff Everything In

BAD:  Paste entire 10,000-line codebase + "fix the bug"
GOOD: Paste the relevant 50 lines + error message + "fix this null pointer"

2. Summarize Conversation History

For long conversations, periodically ask the model to summarize, then start a new conversation with that summary as context.

3. Use RAG Instead of Pasting

Instead of pasting all your documentation:

User question → Retrieve 5 most relevant chunks → Send only those to LLM

See 17 - RAG (Retrieval-Augmented Generation)

4. Hierarchical Context

System Prompt: High-level project context (always present)
Recent Messages: Last 5-10 messages (detailed)
Older Messages: Summarized (compressed)
Relevant Files: Only the ones needed for current task

5. Strategic System Prompts

System prompts persist across EVERY message. A 2,000-token system prompt costs 2,000 input tokens on every single turn. Keep them focused.


Context Window and Cost

Example: 30-Message Conversation with Claude Sonnet

Assume avg 300 tokens per message (user + assistant):

Message 1:  300 tokens processed   → $0.0009
Message 5:  1,500 tokens processed → $0.0045
Message 10: 3,000 tokens processed → $0.009
Message 20: 6,000 tokens processed → $0.018
Message 30: 9,000 tokens processed → $0.027
                                      ─────────
Total input cost across 30 messages:  ~$0.15
+ Output cost (~300 tokens × 30):     ~$0.14
= Total conversation: ~$0.29

Now imagine a system prompt of 5,000 tokens:

  • Adds 5,000 × 30 = 150,000 extra input tokens
  • Extra cost: $0.45 just for the system prompt!

Extended Context (1M+ Tokens)

What 1M Tokens Looks Like

  • ~3,000 pages of text
  • ~15 average-length novels
  • ~50,000 lines of code
  • An entire medium-sized codebase

When to Use Large Context

  • Analyzing entire codebases
  • Processing very long documents (legal, academic)
  • Multi-file code refactoring
  • Comprehensive Q&A over large knowledge bases

When NOT to Use (Even If You Can)

  • When RAG would be more efficient (cheaper, faster)
  • When most of the context is irrelevant
  • When you need the model to focus precisely (lost-in-the-middle risk)

Resources


Previous: 02 - Tokens & Tokenization | Next: 04 - Temperature, Top-P & Sampling