On the Role of Temperature Sampling in Test-Time Scaling
- URL: http://arxiv.org/abs/2510.02611v1
- Date: Thu, 02 Oct 2025 23:09:56 GMT
- Title: On the Role of Temperature Sampling in Test-Time Scaling
- Authors: Yuheng Wu, Azalia Mirhoseini, Thierry Tambe,
- Abstract summary: We show that at large K, further scaling yields no gains, and certain hard questions remain unsolved regardless of the number of traces.<n>Averaged over Qwen3 and five representative reasoning benchmarks, temperature scaling yields an additional 7.3 points over single-temperature TTS.<n>Temperature scaling also enables base models to reach performance comparable to reinforcement learning (RL)-trained counterparts, without additional post-training.
- Score: 5.758728541863352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) can improve reasoning at inference time through test-time scaling (TTS), where multiple reasoning traces are generated and the best one is selected. Prior work shows that increasing the number of samples K steadily improves accuracy. In this paper, we demonstrate that this trend does not hold indefinitely: at large K, further scaling yields no gains, and certain hard questions remain unsolved regardless of the number of traces. Interestingly, we find that different sampling temperatures solve different subsets of problems, implying that single-temperature scaling explores only part of a model's potential. We therefore propose scaling along the temperature dimension, which enlarges the reasoning boundary of LLMs. Averaged over Qwen3 (0.6B, 1.7B, 4B, 8B) and five representative reasoning benchmarks (AIME 2024/2025, MATH500, LiveCodeBench, Hi-ToM), temperature scaling yields an additional 7.3 points over single-temperature TTS. Temperature scaling also enables base models to reach performance comparable to reinforcement learning (RL)-trained counterparts, without additional post-training. We further provide a comprehensive analysis of this phenomenon and design a multi-temperature voting method that reduces the overhead of temperature scaling. Overall, our findings suggest that TTS is more powerful than previously thought, and that temperature scaling offers a simple and effective way to unlock the latent potential of base models.
Related papers
- The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling [22.839278056856433]
We show that increasing the temperature increases the uncertainty in the model in a very general sense.<n>For LLMs, we challenge the common claim that increasing temperature increases diversity.<n>We introduce two new characterisations of temperature scaling.
arXiv Detail & Related papers (2026-02-16T15:54:52Z) - Improving Diversity in Language Models: When Temperature Fails, Change the Loss [81.73385878967899]
We propose rethinking loss functions in language models by leveraging the Precision-Recall framework.<n>Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling.
arXiv Detail & Related papers (2025-08-13T09:37:53Z) - Exploring the Impact of Temperature on Large Language Models:Hot or Cold? [9.70280446429164]
We evaluate the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities.<n>Our findings reveal skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection.<n>We propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt.
arXiv Detail & Related papers (2025-06-08T21:36:26Z) - Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods [1.3892342684177872]
We propose a robust sampling strategy that eliminates the need for temperature calibration.<n>MCT provides more robust uncertainty estimates across a wide range of temperatures.<n>MCT achieves statistical parity with oracle temperatures, which represent the ideal outcome of a well-tuned but computationally expensive HPO process.
arXiv Detail & Related papers (2025-02-25T17:33:20Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [108.07030347318624]
We show that scaling with longer Chain of Thoughts (CoTs) can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.<n>We propose a Thinking- Optimal Scaling strategy to teach models to adopt different reasoning efforts for deep thinking.<n>Our self-improvement models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Optimizing Temperature for Language Models with Multi-Sample Inference [47.14991144052361]
This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different large language models.<n>We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy.<n>We propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines.
arXiv Detail & Related papers (2025-02-07T19:35:25Z) - Adaptive Decoding via Latent Preference Optimization [55.70602730588745]
We introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time.
Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures.
arXiv Detail & Related papers (2024-11-14T18:31:39Z) - Long Horizon Temperature Scaling [90.03310732189543]
Long Horizon Temperature Scaling (LHTS) is a novel approach for sampling from temperature-scaled joint distributions.
We derive a temperature-dependent LHTS objective, and show that finetuning a model on a range of temperatures produces a single model capable of generation with a controllable long horizon temperature parameter.
arXiv Detail & Related papers (2023-02-07T18:59:32Z) - Uhlmann Fidelity and Fidelity Susceptibility for Integrable Spin Chains
at Finite Temperature: Exact Results [68.8204255655161]
We show that the proper inclusion of the odd parity subspace leads to the enhancement of maximal fidelity susceptibility in the intermediate range of temperatures.
The correct low-temperature behavior is captured by an approximation involving the two lowest many-body energy eigenstates.
arXiv Detail & Related papers (2021-05-11T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.