Context Window

What Is the Context Window?

The context window is the model's working memory — the total amount of text (in tokens) it can see at one time. Everything in the window is visible. Everything outside it doesn't exist to the model.

Analogy: Imagine a desk. The context window is the size of that desk. All your papers (conversation, documents, code) must fit on it. Bigger desk = more documents open at once.

┌─────────────────── Context Window (e.g., 200K tokens) ──────────────────┐
│                                                                         │
│  [System Prompt]  [Message 1]  [Reply 1]  [Message 2]  [Reply 2]  ...   │
│  (hidden from     (user)       (assistant) (user)       (assistant)     │
│   user, always                                                          │
│   present)                                                              │
│                                                                         │
│  ◄────────────── Input Tokens ──────────────► ◄── Output Tokens ──►     │
│                (you pay for these)              (you pay for these)     │
└─────────────────────────────────────────────────────────────────────────┘

Context Window Sizes (2025-2026)

Model	Context Window	In Pages of Text (~500 tokens/page)
GPT-4o	128K tokens	~256 pages
GPT-4o mini	128K tokens	~256 pages
Claude Sonnet 4.6	200K tokens	~400 pages
Claude Opus 4.6	200K → 1M tokens	~400 → 2,000 pages
Claude Haiku 4.5	200K tokens	~400 pages
Gemini 2.0 Flash	1M tokens	~2,000 pages
Gemini 1.5 Pro	2M tokens	~4,000 pages

1M tokens ≈ an entire codebase, or multiple full-length novels, or thousands of pages of documentation.

What Goes Into the Context Window

Every API call sends the entire conversation to the model:

API Call for Message 5:
  tokens_sent = system_prompt          (~500 tokens)
              + message_1              (~200 tokens)
              + reply_1                (~800 tokens)
              + message_2              (~150 tokens)
              + reply_2                (~600 tokens)
              + message_3              (~300 tokens)
              + reply_3                (~1000 tokens)
              + message_4              (~100 tokens)
              + reply_4                (~500 tokens)
              + message_5              (~200 tokens)
              ─────────────────────────────────────
              Total input: ~4,350 tokens
              + model generates reply_5 (~800 tokens output)

Key insight: You're paying for ALL previous messages EVERY time. Message 20 in a conversation sends messages 1-19 again.

Input Tokens vs Output Tokens

	Input Tokens	Output Tokens
What	Everything you send (system prompt + history + your message)	What the model generates
Limit	Up to full context window	Separate limit (typically 4K-32K)
Cost	Cheaper (e.g., $3/1M for Sonnet)	More expensive (e.g., $15/1M for Sonnet)
Speed	Processed quickly (parallel)	Generated slowly (sequential, ~50-100 tokens/sec)

Max output tokens is separate from context window:

Claude: default 4,096, can set up to 64,000 (Sonnet) or 32,000 (Opus)
GPT-4o: up to 16,384 output tokens
You can request longer outputs via max_tokens parameter

What Happens When You Exceed the Window

In APIs

Error: API returns an error saying input exceeds maximum context length
You must: Truncate conversation history, summarize, or remove old messages

In Chat Interfaces (Claude.ai, ChatGPT)

Automatic truncation: Oldest messages silently dropped
You notice: Model "forgets" things from earlier in the conversation
Claude Code: Automatic compression of older messages when approaching limits

The "Lost in the Middle" Problem

Research shows models pay less attention to the middle of long contexts:

Attention Level:
  ██████████  Beginning of context (HIGH attention)
  ████        Middle of context (LOWER attention)
  ████████    End of context (HIGH attention)

Practical impact: If you paste 50 files and ask about one, the model performs better if the relevant file is near the beginning or end, not buried in the middle.

Mitigation:

Put the most important context at the START or END
Explicitly reference what the model should focus on
Use RAG to only include relevant chunks instead of everything

Context Window Management Strategies

1. Don't Stuff Everything In

BAD:  Paste entire 10,000-line codebase + "fix the bug"
GOOD: Paste the relevant 50 lines + error message + "fix this null pointer"

2. Summarize Conversation History

For long conversations, periodically ask the model to summarize, then start a new conversation with that summary as context.

3. Use RAG Instead of Pasting

Instead of pasting all your documentation:

User question → Retrieve 5 most relevant chunks → Send only those to LLM

See 17 - RAG (Retrieval-Augmented Generation)

4. Hierarchical Context

System Prompt: High-level project context (always present)
Recent Messages: Last 5-10 messages (detailed)
Older Messages: Summarized (compressed)
Relevant Files: Only the ones needed for current task

5. Strategic System Prompts

System prompts persist across EVERY message. A 2,000-token system prompt costs 2,000 input tokens on every single turn. Keep them focused.

Context Window and Cost

Example: 30-Message Conversation with Claude Sonnet

Assume avg 300 tokens per message (user + assistant):

Message 1:  300 tokens processed   → $0.0009
Message 5:  1,500 tokens processed → $0.0045
Message 10: 3,000 tokens processed → $0.009
Message 20: 6,000 tokens processed → $0.018
Message 30: 9,000 tokens processed → $0.027
                                      ─────────
Total input cost across 30 messages:  ~$0.15
+ Output cost (~300 tokens × 30):     ~$0.14
= Total conversation: ~$0.29

Now imagine a system prompt of 5,000 tokens:

Adds 5,000 × 30 = 150,000 extra input tokens
Extra cost: $0.45 just for the system prompt!

Extended Context (1M+ Tokens)

What 1M Tokens Looks Like

~3,000 pages of text
~15 average-length novels
~50,000 lines of code
An entire medium-sized codebase

When to Use Large Context

Analyzing entire codebases
Processing very long documents (legal, academic)
Multi-file code refactoring
Comprehensive Q&A over large knowledge bases

When NOT to Use (Even If You Can)

When RAG would be more efficient (cheaper, faster)
When most of the context is irrelevant
When you need the model to focus precisely (lost-in-the-middle risk)

Resources

🔗 Anthropic — Context Window Docs
🔗 Lost in the Middle paper (2023)
🎥 ByteByteGo — How ChatGPT Manages Context
🔗 OpenAI — Managing Tokens

Previous: 02 - Tokens & Tokenization | Next: 04 - Temperature, Top-P & Sampling