Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models
- URL: http://arxiv.org/abs/2505.20522v2
- Date: Sat, 07 Jun 2025 22:18:32 GMT
- Title: Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models
- Authors: Jian Wang, Boyan Zhu, Chak Tou Leong, Yongqi Li, Wenjie Li,
- Abstract summary: Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling.<n>As we push these scaling boundaries, understanding the practical limits and achieving optimal resource allocation becomes a critical challenge.<n>In this paper, we investigate the scaling plateau of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM)
- Score: 7.2703757624760526
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling plateau of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.
Related papers
- Probabilistic Optimality for Inference-time Scaling [11.92228840747636]
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs)<n>We propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.)<n>Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling.
arXiv Detail & Related papers (2025-06-27T16:44:11Z) - A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search [15.387256204743407]
Large language models (LLMs) demand considerable computational, energy, and financial resources during both training and deployment.<n>Inference costs now represent a significant and growing component of the overall resource burden.<n>We introduce directed skill search (DS3), a general framework that represents inference as expressive over a learned skill graph.
arXiv Detail & Related papers (2025-06-10T14:47:48Z) - Kinetics: Rethinking Test-Time Scaling Laws [18.325591438335007]
Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones.<n>Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples.
arXiv Detail & Related papers (2025-06-05T17:59:24Z) - Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z) - Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory [79.63672515243765]
In this paper, we focus on a standard and realistic scaling setting: majority voting.<n>We show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.<n>We propose a method according to probability theory to quickly and accurately predict the scaling performance and select the best strategy under large sampling times.
arXiv Detail & Related papers (2025-05-16T08:28:57Z) - A Survey of Scaling in Large Language Model Reasoning [62.92861523305361]
We provide a comprehensive examination of scaling in large Language models (LLMs) reasoning.<n>We analyze scaling in reasoning steps that improves multi-step inference and logical consistency.<n>We discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement.
arXiv Detail & Related papers (2025-04-02T23:51:27Z) - Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [33.011660907969706]
Inference-time scaling can enhance the reasoning capabilities of large language models.<n>We investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks.
arXiv Detail & Related papers (2025-03-31T23:40:28Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.