Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
- URL: http://arxiv.org/abs/2502.18080v1
- Date: Tue, 25 Feb 2025 10:48:05 GMT
- Title: Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
- Authors: Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei,
- Abstract summary: Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
- Score: 113.49074603075032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model's reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with QwQ-32B-Preview.
Related papers
- ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning [1.170732359523702]
Reasoning models such as OpenAI o3 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks.
Long reasoning traces can facilitate a more thorough exploration of solution paths for complex problems.
We introduce ShorterBetter, a simple yet effective reinforcement learning methed that enables reasoning language models to discover their own optimal CoT lengths.
arXiv Detail & Related papers (2025-04-30T07:04:19Z) - Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods [39.89239733570008]
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models.
We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models.
For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods.
arXiv Detail & Related papers (2025-04-18T19:32:55Z) - A Survey of Scaling in Large Language Model Reasoning [62.92861523305361]
We provide a comprehensive examination of scaling in large Language models (LLMs) reasoning.
We analyze scaling in reasoning steps that improves multi-step inference and logical consistency.
We discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement.
arXiv Detail & Related papers (2025-04-02T23:51:27Z) - Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [57.17826305464394]
o1-like models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs)
We introduce the DeltaBench, including the generated long CoTs from different o1-like models for different reasoning tasks.
Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models.
arXiv Detail & Related papers (2025-02-26T17:59:27Z) - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)<n>However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?<n>In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach [70.44265766483633]
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space.<n>Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time.<n>We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically.
arXiv Detail & Related papers (2025-02-07T18:55:02Z) - Trading Inference-Time Compute for Adversarial Robustness [27.514612815314084]
We conduct experiments on the impact of increasing inference-time compute in reasoning models on their robustness to adversarial attacks.<n>We find that across a variety of attacks, increased inference-time compute leads to improved robustness.
arXiv Detail & Related papers (2025-01-31T01:20:44Z) - Optimizing Chain-of-Thought Reasoning: Tackling Arranging Bottleneck via Plan Augmentation [34.042565099565934]
We propose a plan-based training and reasoning method that guides models to generate arranging steps through abstract plans.
Results show that compared to fine-tuning directly with CoT data, our approach achieves a better performance on alleviating arranging bottleneck.
arXiv Detail & Related papers (2024-10-22T08:38:50Z) - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.