Context Window
What Is the Context Window?
The context window is the model's working memory — the total amount of text (in tokens) it can see at one time. Everything in the window is visible. Everything outside it doesn't exist to the model.
Analogy: Imagine a desk. The context window is the size of that desk. All your papers (conversation, documents, code) must fit on it. Bigger desk = more documents open at once.
┌─────────────────── Context Window (e.g., 200K tokens) ──────────────────┐
│ │
│ [System Prompt] [Message 1] [Reply 1] [Message 2] [Reply 2] ... │
│ (hidden from (user) (assistant) (user) (assistant) │
│ user, always │
│ present) │
│ │
│ ◄────────────── Input Tokens ──────────────► ◄── Output Tokens ──► │
│ (you pay for these) (you pay for these) │
└─────────────────────────────────────────────────────────────────────────┘
Context Window Sizes (2025-2026)
| Model | Context Window | In Pages of Text (~500 tokens/page) |
|---|---|---|
| GPT-4o | 128K tokens | ~256 pages |
| GPT-4o mini | 128K tokens | ~256 pages |
| Claude Sonnet 4.6 | 200K tokens | ~400 pages |
| Claude Opus 4.6 | 200K → 1M tokens | ~400 → 2,000 pages |
| Claude Haiku 4.5 | 200K tokens | ~400 pages |
| Gemini 2.0 Flash | 1M tokens | ~2,000 pages |
| Gemini 1.5 Pro | 2M tokens | ~4,000 pages |
1M tokens ≈ an entire codebase, or multiple full-length novels, or thousands of pages of documentation.
What Goes Into the Context Window
Every API call sends the entire conversation to the model:
API Call for Message 5:
tokens_sent = system_prompt (~500 tokens)
+ message_1 (~200 tokens)
+ reply_1 (~800 tokens)
+ message_2 (~150 tokens)
+ reply_2 (~600 tokens)
+ message_3 (~300 tokens)
+ reply_3 (~1000 tokens)
+ message_4 (~100 tokens)
+ reply_4 (~500 tokens)
+ message_5 (~200 tokens)
─────────────────────────────────────
Total input: ~4,350 tokens
+ model generates reply_5 (~800 tokens output)
Key insight: You're paying for ALL previous messages EVERY time. Message 20 in a conversation sends messages 1-19 again.
Input Tokens vs Output Tokens
| Input Tokens | Output Tokens | |
|---|---|---|
| What | Everything you send (system prompt + history + your message) | What the model generates |
| Limit | Up to full context window | Separate limit (typically 4K-32K) |
| Cost | Cheaper (e.g., $3/1M for Sonnet) | More expensive (e.g., $15/1M for Sonnet) |
| Speed | Processed quickly (parallel) | Generated slowly (sequential, ~50-100 tokens/sec) |
Max output tokens is separate from context window:
- Claude: default 4,096, can set up to 64,000 (Sonnet) or 32,000 (Opus)
- GPT-4o: up to 16,384 output tokens
- You can request longer outputs via
max_tokensparameter
What Happens When You Exceed the Window
In APIs
- Error: API returns an error saying input exceeds maximum context length
- You must: Truncate conversation history, summarize, or remove old messages
In Chat Interfaces (Claude.ai, ChatGPT)
- Automatic truncation: Oldest messages silently dropped
- You notice: Model "forgets" things from earlier in the conversation
- Claude Code: Automatic compression of older messages when approaching limits
The "Lost in the Middle" Problem
Research shows models pay less attention to the middle of long contexts:
Attention Level:
██████████ Beginning of context (HIGH attention)
████ Middle of context (LOWER attention)
████████ End of context (HIGH attention)
Practical impact: If you paste 50 files and ask about one, the model performs better if the relevant file is near the beginning or end, not buried in the middle.
Mitigation:
- Put the most important context at the START or END
- Explicitly reference what the model should focus on
- Use RAG to only include relevant chunks instead of everything
Context Window Management Strategies
1. Don't Stuff Everything In
BAD: Paste entire 10,000-line codebase + "fix the bug"
GOOD: Paste the relevant 50 lines + error message + "fix this null pointer"
2. Summarize Conversation History
For long conversations, periodically ask the model to summarize, then start a new conversation with that summary as context.
3. Use RAG Instead of Pasting
Instead of pasting all your documentation:
User question → Retrieve 5 most relevant chunks → Send only those to LLM
See 17 - RAG (Retrieval-Augmented Generation)
4. Hierarchical Context
System Prompt: High-level project context (always present)
Recent Messages: Last 5-10 messages (detailed)
Older Messages: Summarized (compressed)
Relevant Files: Only the ones needed for current task
5. Strategic System Prompts
System prompts persist across EVERY message. A 2,000-token system prompt costs 2,000 input tokens on every single turn. Keep them focused.
Context Window and Cost
Example: 30-Message Conversation with Claude Sonnet
Assume avg 300 tokens per message (user + assistant):
Message 1: 300 tokens processed → $0.0009
Message 5: 1,500 tokens processed → $0.0045
Message 10: 3,000 tokens processed → $0.009
Message 20: 6,000 tokens processed → $0.018
Message 30: 9,000 tokens processed → $0.027
─────────
Total input cost across 30 messages: ~$0.15
+ Output cost (~300 tokens × 30): ~$0.14
= Total conversation: ~$0.29
Now imagine a system prompt of 5,000 tokens:
- Adds 5,000 × 30 = 150,000 extra input tokens
- Extra cost: $0.45 just for the system prompt!
Extended Context (1M+ Tokens)
What 1M Tokens Looks Like
- ~3,000 pages of text
- ~15 average-length novels
- ~50,000 lines of code
- An entire medium-sized codebase
When to Use Large Context
- Analyzing entire codebases
- Processing very long documents (legal, academic)
- Multi-file code refactoring
- Comprehensive Q&A over large knowledge bases
When NOT to Use (Even If You Can)
- When RAG would be more efficient (cheaper, faster)
- When most of the context is irrelevant
- When you need the model to focus precisely (lost-in-the-middle risk)
Resources
- 🔗 Anthropic — Context Window Docs
- 🔗 Lost in the Middle paper (2023)
- 🎥 ByteByteGo — How ChatGPT Manages Context
- 🔗 OpenAI — Managing Tokens
Previous: 02 - Tokens & Tokenization | Next: 04 - Temperature, Top-P & Sampling