Temperature, Top-P & Sampling
How Does the Model Actually Generate Text?
The model doesn't "write" text. It generates one token at a time by:
- Looking at all tokens so far
- Computing a probability distribution over the entire vocabulary (~100K tokens)
- Sampling one token from that distribution
- Adding it to the sequence, then repeating
Input: "The best programming language is"
Model's probability distribution:
"Python" → 25%
" Python" → 20%
" JavaScript" → 12%
" Rust" → 8%
" C" → 5%
" Go" → 4%
... (100K tokens, most near 0%)
The question is: how do we pick from these probabilities? That's where temperature, top-p, and top-k come in.
Temperature
Controls how "spread out" or "peaked" the probability distribution is.
Temperature = 0 (Deterministic)
Probabilities after temperature 0:
"Python" → 99.9% ← always pick this
" Python" → 0.1%
everything else → ~0%
- Always picks the most likely token
- Deterministic — same input = same output every time
- Best for: factual answers, code, math, structured output
Temperature = 0.5 (Balanced)
Probabilities after temperature 0.5:
"Python" → 45% ← usually picks this
" Python" → 30% ← sometimes this
" JavaScript" → 12%
" Rust" → 7%
...
- Mostly picks likely tokens but allows some variety
- Best for: general conversation, explanations
Temperature = 1.0 (Original Distribution)
Probabilities unchanged:
"Python" → 25%
" Python" → 20%
" JavaScript" → 12%
" Rust" → 8%
...
- Samples proportionally — more diverse outputs
- Best for: creative writing, brainstorming
Temperature > 1.0 (Chaotic)
Probabilities after temperature 1.5:
"Python" → 15%
" Python" → 13%
" JavaScript" → 11%
" Rust" → 9%
" the" → 5% ← random words creep in
...
- Flattens distribution — unlikely tokens become more likely
- Often produces incoherent, random text
- Rarely useful
The Math (Simple Version)
adjusted_probability[i] = original_probability[i] ^ (1/temperature)
(then normalize so everything sums to 1)
- Low temperature → exaggerates differences (likely tokens become MORE likely)
- High temperature → reduces differences (everything becomes similarly likely)
Top-P (Nucleus Sampling)
Instead of adjusting all probabilities, only consider tokens that together make up the top P% of probability mass.
Top-P = 0.9
Token Probability Cumulative
"Python" 25% 25%
" Python" 20% 45%
" JavaScript" 12% 57%
" Rust" 8% 65%
" C" 5% 70%
" Go" 4% 74%
" TypeScript" 3% 77%
" Java" 3% 80%
" the" 2% 82%
" Kotlin" 2% 84%
" Swift" 1.5% 85.5%
" arguably" 1.5% 87%
" C++" 1% 88%
" definitely" 1% 89%
" none" 0.5% 89.5%
" Haskell" 0.5% 90% ← CUTOFF (cumulative reached 90%)
─── Everything below is EXCLUDED ───
" PHP" 0.3% ✘
" assembly" 0.1% ✘
" banana" 0.0001% ✘ (would never happen but is technically possible)
Benefits: Removes very unlikely (potentially nonsensical) tokens, while dynamically adjusting how many tokens to consider based on the model's confidence.
- High confidence (one token has 95%) → nucleus is tiny (1-2 tokens)
- Low confidence (spread across many) → nucleus is large (many tokens)
Top-K
Simpler than Top-P: only consider the K most likely tokens.
Top-K = 5:
Only consider: "Python", " Python", " JavaScript", " Rust", " C"
Everything else: excluded
Disadvantage: Fixed K regardless of distribution. When the model is very confident, K=50 still includes garbage. When uncertain, K=5 might exclude good options.
Top-P is generally preferred because it adapts dynamically.
How They Work Together
API call:
temperature = 0.7
top_p = 0.95
top_k = 50 (some APIs)
Step 1: Apply temperature to adjust distribution
Step 2: Apply top-k to keep only top 50 tokens
Step 3: Apply top-p to keep only tokens summing to 95%
Step 4: Sample from remaining tokens
Best practice: Adjust ONE of temperature or top-p, not both to extreme values.
Practical Settings
| Use Case | Temperature | Top-P | Why |
|---|---|---|---|
| Code generation | 0 - 0.2 | 0.95 | Correctness over creativity |
| Factual Q&A | 0 | 1.0 | Want the most likely answer |
| General conversation | 0.5 - 0.7 | 0.95 | Natural but focused |
| Creative writing | 0.8 - 1.0 | 0.95 | Diverse, interesting language |
| Brainstorming | 0.9 - 1.0 | 0.99 | Want novel ideas |
| JSON/structured output | 0 | 1.0 | Must be valid format |
Extended Thinking
Modern models like Claude offer extended thinking — the model reasons internally before responding.
Without thinking:
User: "What's 127 × 84?" → Model: "10,668" (one-shot guess, might be wrong)
With thinking:
User: "What's 127 × 84?"
Model (thinking): "Let me break this down... 127 × 80 = 10,160, 127 × 4 = 508,
10,160 + 508 = 10,668"
Model (output): "10,668" (reasoned step by step, more reliable)
- Thinking tokens use the model's context but are shown separately
- Think of it as "chain-of-thought inside the model"
- Useful for math, logic, complex reasoning, planning
- Claude: enabled by default, uses up to 32K thinking tokens
- Toggle with Option+T (macOS) / Alt+T (Windows)
Other Sampling Parameters
| Parameter | What It Does | Common Values |
|---|---|---|
max_tokens | Maximum tokens in the response | 1024, 4096, 8192 |
stop_sequences | Stop generating when these strings appear | ["\n\n", "END"] |
frequency_penalty | Penalize tokens that appear often (reduce repetition) | 0.0 - 1.0 |
presence_penalty | Penalize tokens that appeared at all (encourage novelty) | 0.0 - 1.0 |
Resources
- 🔗 Anthropic — Sampling Parameters
- 🔗 Cohere — Temperature and Sampling
- 🎥 Jay Alammar — The Illustrated GPT-2 (Visualizing Sampling)
- 🔗 Hugging Face — How to Generate Text
Previous: 03 - Context Window | Next: 05 - Embeddings & Vector Search