Concept
Foundations: working with LLMs

Temperature, top-p, top-k: three knobs people keep confusing

Three sampling parameters with overlapping effects. Only two of them earn their place in production.

A model does not pick the next token deterministically. At each step it computes a probability distribution over the whole vocabulary and samples one. Temperature, top-p, and top-k are three knobs that change how that sampling works. They control how random the output is, in three different ways. In practice you usually set temperature, sometimes set top-p, and almost never touch top-k. Knowing why is the difference between a confident answer in an interview and “I think they all do roughly the same thing.”

What the model is doing at each step

flowchart LR
    P["Prompt so far"]:::p --> M[/"Model"/]:::m
    M --> D[("Probability over<br/>~50,000 tokens<br/>e.g. 'cat': 0.41, 'dog': 0.30...")]:::dist
    D --> S{"Sampler<br/>(temperature, top-p, top-k)"}:::s
    S --> T[("Pick one token")]:::t

    classDef p fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef m fill:#fed7aa,stroke:#c2410c,color:#7c2d12
    classDef dist fill:#fef3c7,stroke:#a16207,color:#713f12
    classDef s fill:#e9d5ff,stroke:#7e22ce,color:#581c87
    classDef t fill:#dcfce7,stroke:#15803d,color:#14532d

The model produces a distribution. The sampler picks. All three knobs work on what the sampler sees and how it picks.

Temperature: how sharp the distribution is

Temperature reshapes the probability distribution before sampling. Low temperature makes high-probability tokens even more dominant. High temperature flattens everything out.

1
2
3
T = 0.0:   'cat': 0.99, 'dog': 0.01, 'fish': 0.00, ...   -> always 'cat'
T = 0.7:   'cat': 0.55, 'dog': 0.25, 'fish': 0.10, ...   -> usually 'cat'
T = 1.5:   'cat': 0.30, 'dog': 0.22, 'fish': 0.18, ...   -> often 'dog' or 'fish'

Temperature 0 is as close to deterministic as the API offers. Same prompt, same model, same seed (if exposed), and you usually get the same output. “Usually” because providers still ship small non-determinism even at 0 due to batching and floating-point edge cases.

Temperature 0.7 is a common default for “natural sounding” responses. Temperature 1.0 is the model’s untouched distribution.

The honest map:

TemperatureUse case
0.0Classification, extraction, code generation, structured outputs. Anything where there is one right answer.
0.3-0.7Conversational chat, explanations, summarisation.
0.7-1.0Creative writing, brainstorming, generating multiple ideas.
> 1.0Rare in production. Sometimes used to force diversity in batch generation.

Top-p (nucleus sampling): cap the cumulative probability

Top-p truncates the distribution to the smallest set of tokens whose combined probability is at least p. Everything outside that nucleus is ignored before sampling.

1
2
3
4
5
6
7
T = 1.0 (no temp change), distribution sorted:
  'cat':   0.40
  'dog':   0.25
  'fish':  0.15   <- top_p = 0.8 cuts here (0.40 + 0.25 + 0.15 = 0.80)
  'bird':  0.10
  'cow':   0.05
  ...

With top_p=0.8, the sampler picks among cat / dog / fish only. The long tail of unlikely tokens is gone.

Top-p prevents the model from picking very low-probability tokens, the kind that produce odd word choices or random topic shifts. The lower the p, the more conservative. p=0.9 to 1.0 is typical. p=0.5 is aggressive and rarely needed.

Top-k: keep the K most likely tokens

Top-k truncates the distribution to the K most likely tokens, regardless of their probabilities.

1
2
top_k=3:   keep 'cat', 'dog', 'fish'.   Drop the rest.
top_k=50:  keep the top 50.

This is the bluntest of the three knobs. It does not care whether top_k=10 covers 99% of the mass or 30% of it. Most providers expose temperature and top-p but only some expose top-k (Anthropic does, OpenAI’s chat completions does not).

In practice top-k is rarely useful. Top-p does the same job adaptively.

How they interact

flowchart LR
    DIST[("Raw distribution<br/>50k tokens")]:::a --> TEMP[Temperature<br/>reshapes]:::tx --> TPK[Top-k<br/>keep N]:::tx --> TPP[Top-p<br/>keep cumulative p]:::tx --> SAMP[Sample one]:::ok

    classDef a fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef tx fill:#fef3c7,stroke:#a16207,color:#713f12
    classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d

Order matters. Temperature reshapes first, then top-k filters, then top-p filters, then the sampler picks. Setting all three with low values is over-constraining; they compound.

A clean default for production:

  • temperature: pick deliberately, document why.
  • top_p: leave at provider default (usually 1.0) unless you have a specific problem.
  • top_k: do not set unless you know exactly why.

“Deterministic” is mostly a lie

People assume temperature=0 means “same input, same output, always.” It does not, quite.

Providers still ship sources of randomness even at 0:

  • Batching. Other users’ calls in the same batch can change tie-breaking.
  • Hardware variation. GPU non-determinism in matrix multiplies.
  • Quiet model updates. Today’s claude-3-7 might not be the same weights as yesterday’s, if a minor revision shipped.

Pin the model version where the provider supports it (claude-3-7-sonnet-20260301 vs claude-3-7-sonnet-latest). Even then, run the same prompt twice and expect ~95% identical output, not 100%. This is why you cannot test LLM systems with assert output == "expected". It is also Stage 5’s whole point.

When to actually change these

SymptomKnobDirection
Output is too random, picks weird wordstemperaturedown
Output is repetitive, never tries new phrasestemperatureup
Output occasionally goes off-topictop_pdown
You need structured output and small variation breaks ittemperature0
You want N different outputs from the same prompttemperatureup + n>1

For most production tasks, you set temperature to either 0 (when you want repeatability) or 0.7 (when you want sensible variation), leave top_p alone, and ignore top_k. The other 90% of “quality issues” come from prompt design, not sampling parameters.

Common mistakes

  • Tuning all three at once. You cannot tell which one helped. Change one at a time.
  • Setting temperature=0 and assuming the result is reproducible. It usually is not, perfectly.
  • Cranking top_p very low to “stop hallucinations.” It does not. Hallucinations are about content, not sampling. Better prompt or retrieval.
  • Using top_k because you saw it in a tutorial. Top-p does the same job adaptively and is provider-agnostic.
  • Ignoring sampling for classification tasks. Use temperature=0 so the label is stable.

Quick recap

  • The model produces a distribution; the sampler picks. Temperature, top-p, top-k change what the sampler sees.
  • Temperature reshapes how sharp the distribution is. 0 for deterministic-ish, 0.7 for natural, 1.0 for varied.
  • Top-p truncates to the smallest nucleus that sums to p. Good safety belt against weird tail tokens.
  • Top-k truncates to the top K tokens. Bluntest tool. Usually unnecessary if you have top-p.
  • “Temperature 0 is deterministic” is mostly true and not quite reliable. Plan for tests that tolerate small variation.
  • In production: set temperature deliberately, leave top-p and top-k alone unless you have a real reason.

This concept sits in Stage 1 (Foundations: working with LLMs) of the AI Engineering Roadmap.

Last updated