Temperature, Top-P & Sampling

How Does the Model Actually Generate Text?

The model doesn't "write" text. It generates one token at a time by:

  1. Looking at all tokens so far
  2. Computing a probability distribution over the entire vocabulary (~100K tokens)
  3. Sampling one token from that distribution
  4. Adding it to the sequence, then repeating
Input: "The best programming language is"

Model's probability distribution:
  "Python"      → 25%
  " Python"     → 20%
  " JavaScript" → 12%
  " Rust"       → 8%
  " C"          → 5%
  " Go"         → 4%
  ... (100K tokens, most near 0%)

The question is: how do we pick from these probabilities? That's where temperature, top-p, and top-k come in.


Temperature

Controls how "spread out" or "peaked" the probability distribution is.

Temperature = 0 (Deterministic)

Probabilities after temperature 0:
  "Python"      → 99.9%  ← always pick this
  " Python"     → 0.1%
  everything else → ~0%
  • Always picks the most likely token
  • Deterministic — same input = same output every time
  • Best for: factual answers, code, math, structured output

Temperature = 0.5 (Balanced)

Probabilities after temperature 0.5:
  "Python"      → 45%   ← usually picks this
  " Python"     → 30%   ← sometimes this
  " JavaScript" → 12%
  " Rust"       → 7%
  ...
  • Mostly picks likely tokens but allows some variety
  • Best for: general conversation, explanations

Temperature = 1.0 (Original Distribution)

Probabilities unchanged:
  "Python"      → 25%
  " Python"     → 20%
  " JavaScript" → 12%
  " Rust"       → 8%
  ...
  • Samples proportionally — more diverse outputs
  • Best for: creative writing, brainstorming

Temperature > 1.0 (Chaotic)

Probabilities after temperature 1.5:
  "Python"      → 15%
  " Python"     → 13%
  " JavaScript" → 11%
  " Rust"       → 9%
  " the"        → 5%   ← random words creep in
  ...
  • Flattens distribution — unlikely tokens become more likely
  • Often produces incoherent, random text
  • Rarely useful

The Math (Simple Version)

adjusted_probability[i] = original_probability[i] ^ (1/temperature)
(then normalize so everything sums to 1)
  • Low temperature → exaggerates differences (likely tokens become MORE likely)
  • High temperature → reduces differences (everything becomes similarly likely)

Top-P (Nucleus Sampling)

Instead of adjusting all probabilities, only consider tokens that together make up the top P% of probability mass.

Top-P = 0.9

Token           Probability  Cumulative
"Python"        25%          25%
" Python"       20%          45%
" JavaScript"   12%          57%
" Rust"         8%           65%
" C"            5%           70%
" Go"           4%           74%
" TypeScript"   3%           77%
" Java"         3%           80%
" the"          2%           82%
" Kotlin"       2%           84%
" Swift"        1.5%         85.5%
" arguably"     1.5%         87%
" C++"          1%           88%
" definitely"   1%           89%
" none"         0.5%         89.5%
" Haskell"      0.5%         90%    ← CUTOFF (cumulative reached 90%)
─── Everything below is EXCLUDED ───
" PHP"          0.3%         ✘
" assembly"     0.1%         ✘
" banana"       0.0001%      ✘ (would never happen but is technically possible)

Benefits: Removes very unlikely (potentially nonsensical) tokens, while dynamically adjusting how many tokens to consider based on the model's confidence.

  • High confidence (one token has 95%) → nucleus is tiny (1-2 tokens)
  • Low confidence (spread across many) → nucleus is large (many tokens)

Top-K

Simpler than Top-P: only consider the K most likely tokens.

Top-K = 5:
  Only consider: "Python", " Python", " JavaScript", " Rust", " C"
  Everything else: excluded

Disadvantage: Fixed K regardless of distribution. When the model is very confident, K=50 still includes garbage. When uncertain, K=5 might exclude good options.

Top-P is generally preferred because it adapts dynamically.


How They Work Together

API call:
  temperature = 0.7
  top_p = 0.95
  top_k = 50  (some APIs)

Step 1: Apply temperature to adjust distribution
Step 2: Apply top-k to keep only top 50 tokens
Step 3: Apply top-p to keep only tokens summing to 95%
Step 4: Sample from remaining tokens

Best practice: Adjust ONE of temperature or top-p, not both to extreme values.


Practical Settings

Use CaseTemperatureTop-PWhy
Code generation0 - 0.20.95Correctness over creativity
Factual Q&A01.0Want the most likely answer
General conversation0.5 - 0.70.95Natural but focused
Creative writing0.8 - 1.00.95Diverse, interesting language
Brainstorming0.9 - 1.00.99Want novel ideas
JSON/structured output01.0Must be valid format

Extended Thinking

Modern models like Claude offer extended thinking — the model reasons internally before responding.

Without thinking:
  User: "What's 127 × 84?" → Model: "10,668" (one-shot guess, might be wrong)

With thinking:
  User: "What's 127 × 84?"
  Model (thinking): "Let me break this down... 127 × 80 = 10,160, 127 × 4 = 508, 
                      10,160 + 508 = 10,668"
  Model (output): "10,668" (reasoned step by step, more reliable)
  • Thinking tokens use the model's context but are shown separately
  • Think of it as "chain-of-thought inside the model"
  • Useful for math, logic, complex reasoning, planning
  • Claude: enabled by default, uses up to 32K thinking tokens
  • Toggle with Option+T (macOS) / Alt+T (Windows)

Other Sampling Parameters

ParameterWhat It DoesCommon Values
max_tokensMaximum tokens in the response1024, 4096, 8192
stop_sequencesStop generating when these strings appear["\n\n", "END"]
frequency_penaltyPenalize tokens that appear often (reduce repetition)0.0 - 1.0
presence_penaltyPenalize tokens that appeared at all (encourage novelty)0.0 - 1.0

Resources


Previous: 03 - Context Window | Next: 05 - Embeddings & Vector Search