Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
- URL: http://arxiv.org/abs/2509.04474v1
- Date: Sat, 30 Aug 2025 01:54:55 GMT
- Title: Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
- Authors: Shengyin Sun, Yiming Li, Xing Li, Yingzhao Lian, Weizhe Lin, Hui-Ling Zhen, Zhiyuan Yang, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma,
- Abstract summary: Test-time scaling is a powerful paradigm for enhancing the reasoning capabilities of large language models.<n>Test-time scaling is inherently inefficient due to the generation of redundant and repetitive reasoning traces.<n>We introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating test-time scaling.
- Score: 38.27469349005585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.
Related papers
- Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z) - Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts [19.518525241726916]
Encode-Think-Decode (ETD) is a method that enhances the reasoning capabilities of a base model by training it to iterate over a small subset of reasoning-relevant layers during the mid-training stage.<n>ETD models yield substantial gains on 17 reasoning benchmarks, including +28.4% relative accuracy improvement on GSM8K and +36% on MATH with the OLMo-2 1B Base model.
arXiv Detail & Related papers (2025-10-08T15:58:35Z) - Probabilistic Optimality for Inference-time Scaling [8.126757296203957]
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs)<n>We propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed.<n>We develop OptScale, a practical algorithm that dynamically determines the optimal number of sampled responses.
arXiv Detail & Related papers (2025-06-27T16:44:11Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models [7.2703757624760526]
Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling.<n>As we push these scaling boundaries, understanding the practical limits and achieving optimal resource allocation becomes a critical challenge.<n>In this paper, we investigate the scaling plateau of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM)
arXiv Detail & Related papers (2025-05-26T20:58:45Z) - Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z) - Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory [79.63672515243765]
In this paper, we focus on a standard and realistic scaling setting: majority voting.<n>We show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.<n>We propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times.
arXiv Detail & Related papers (2025-05-16T08:28:57Z) - Latent Thought Models with Variational Bayes Inference-Time Computation [52.63299874322121]
Latent Thought Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling [39.57154199908565]
Self-Enhanced Test-Time Scaling (SETS) is a simple yet effective approach that overcomes limitations by strategically combining parallel and sequential techniques.<n>SETS exploits the inherent self-verification and self- computation capabilities of Large Language Models, unifying sampling, verification, and correction within a single framework.<n>Our results show SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
arXiv Detail & Related papers (2025-01-31T17:03:16Z) - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.