Temperature, Top-P & Sampling

How Does the Model Actually Generate Text?

The model doesn't "write" text. It generates one token at a time by:

Looking at all tokens so far
Computing a probability distribution over the entire vocabulary (~100K tokens)
Sampling one token from that distribution
Adding it to the sequence, then repeating

Input: "The best programming language is"

Model's probability distribution:
  "Python"      → 25%
  " Python"     → 20%
  " JavaScript" → 12%
  " Rust"       → 8%
  " C"          → 5%
  " Go"         → 4%
  ... (100K tokens, most near 0%)

The question is: how do we pick from these probabilities? That's where temperature, top-p, and top-k come in.

Temperature

Controls how "spread out" or "peaked" the probability distribution is.

Temperature = 0 (Deterministic)

Probabilities after temperature 0:
  "Python"      → 99.9%  ← always pick this
  " Python"     → 0.1%
  everything else → ~0%

Always picks the most likely token
Deterministic — same input = same output every time
Best for: factual answers, code, math, structured output

Temperature = 0.5 (Balanced)

Probabilities after temperature 0.5:
  "Python"      → 45%   ← usually picks this
  " Python"     → 30%   ← sometimes this
  " JavaScript" → 12%
  " Rust"       → 7%
  ...

Mostly picks likely tokens but allows some variety
Best for: general conversation, explanations

Temperature = 1.0 (Original Distribution)

Probabilities unchanged:
  "Python"      → 25%
  " Python"     → 20%
  " JavaScript" → 12%
  " Rust"       → 8%
  ...

Samples proportionally — more diverse outputs
Best for: creative writing, brainstorming

Temperature > 1.0 (Chaotic)

Probabilities after temperature 1.5:
  "Python"      → 15%
  " Python"     → 13%
  " JavaScript" → 11%
  " Rust"       → 9%
  " the"        → 5%   ← random words creep in
  ...

Flattens distribution — unlikely tokens become more likely
Often produces incoherent, random text
Rarely useful

The Math (Simple Version)

adjusted_probability[i] = original_probability[i] ^ (1/temperature)
(then normalize so everything sums to 1)

Low temperature → exaggerates differences (likely tokens become MORE likely)
High temperature → reduces differences (everything becomes similarly likely)

Top-P (Nucleus Sampling)

Instead of adjusting all probabilities, only consider tokens that together make up the top P% of probability mass.

Top-P = 0.9

Token           Probability  Cumulative
"Python"        25%          25%
" Python"       20%          45%
" JavaScript"   12%          57%
" Rust"         8%           65%
" C"            5%           70%
" Go"           4%           74%
" TypeScript"   3%           77%
" Java"         3%           80%
" the"          2%           82%
" Kotlin"       2%           84%
" Swift"        1.5%         85.5%
" arguably"     1.5%         87%
" C++"          1%           88%
" definitely"   1%           89%
" none"         0.5%         89.5%
" Haskell"      0.5%         90%    ← CUTOFF (cumulative reached 90%)
─── Everything below is EXCLUDED ───
" PHP"          0.3%         ✘
" assembly"     0.1%         ✘
" banana"       0.0001%      ✘ (would never happen but is technically possible)

Benefits: Removes very unlikely (potentially nonsensical) tokens, while dynamically adjusting how many tokens to consider based on the model's confidence.

High confidence (one token has 95%) → nucleus is tiny (1-2 tokens)
Low confidence (spread across many) → nucleus is large (many tokens)

Top-K

Simpler than Top-P: only consider the K most likely tokens.

Top-K = 5:
  Only consider: "Python", " Python", " JavaScript", " Rust", " C"
  Everything else: excluded

Disadvantage: Fixed K regardless of distribution. When the model is very confident, K=50 still includes garbage. When uncertain, K=5 might exclude good options.

Top-P is generally preferred because it adapts dynamically.

How They Work Together

API call:
  temperature = 0.7
  top_p = 0.95
  top_k = 50  (some APIs)

Step 1: Apply temperature to adjust distribution
Step 2: Apply top-k to keep only top 50 tokens
Step 3: Apply top-p to keep only tokens summing to 95%
Step 4: Sample from remaining tokens

Best practice: Adjust ONE of temperature or top-p, not both to extreme values.

Practical Settings

Use Case	Temperature	Top-P	Why
Code generation	0 - 0.2	0.95	Correctness over creativity
Factual Q&A	0	1.0	Want the most likely answer
General conversation	0.5 - 0.7	0.95	Natural but focused
Creative writing	0.8 - 1.0	0.95	Diverse, interesting language
Brainstorming	0.9 - 1.0	0.99	Want novel ideas
JSON/structured output	0	1.0	Must be valid format

Extended Thinking

Modern models like Claude offer extended thinking — the model reasons internally before responding.

Without thinking:
  User: "What's 127 × 84?" → Model: "10,668" (one-shot guess, might be wrong)

With thinking:
  User: "What's 127 × 84?"
  Model (thinking): "Let me break this down... 127 × 80 = 10,160, 127 × 4 = 508, 
                      10,160 + 508 = 10,668"
  Model (output): "10,668" (reasoned step by step, more reliable)

Thinking tokens use the model's context but are shown separately
Think of it as "chain-of-thought inside the model"
Useful for math, logic, complex reasoning, planning
Claude: enabled by default, uses up to 32K thinking tokens
Toggle with Option+T (macOS) / Alt+T (Windows)

Other Sampling Parameters

Parameter	What It Does	Common Values
`max_tokens`	Maximum tokens in the response	1024, 4096, 8192
`stop_sequences`	Stop generating when these strings appear	["\n\n", "END"]
`frequency_penalty`	Penalize tokens that appear often (reduce repetition)	0.0 - 1.0
`presence_penalty`	Penalize tokens that appeared at all (encourage novelty)	0.0 - 1.0

Resources

Previous: 03 - Context Window | Next: 05 - Embeddings & Vector Search