$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
- URL: http://arxiv.org/abs/2506.15733v1
- Date: Sun, 15 Jun 2025 05:50:05 GMT
- Title: $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
- Authors: Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun,
- Abstract summary: $textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
- Score: 55.231201692232894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPS), often overlooking latency constraints. To address this gap, we propose $\texttt{SPECS}$, a latency-aware test-time scaling method inspired by speculative decoding. $\texttt{SPECS}$~uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism. Empirical results on MATH500, AMC23 and OlympiadBench datasets show that $\texttt{SPECS}$~matches or surpasses beam search accuracy while reducing latency by up to $\sim$19.1\%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.
Related papers
- Kinetics: Rethinking Test-Time Scaling Laws [18.325591438335007]
Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones.<n>Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples.
arXiv Detail & Related papers (2025-06-05T17:59:24Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling [19.673388630963807]
Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs)<n>How to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time.<n>We propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias.
arXiv Detail & Related papers (2025-05-30T09:05:25Z) - Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory [79.63672515243765]
In this paper, we focus on a standard and realistic scaling setting: majority voting.<n>We show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.<n>We propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times.
arXiv Detail & Related papers (2025-05-16T08:28:57Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning [90.5036809670993]
Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models.<n>Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task.<n>We evaluate GenRM against Self-Consistency (SC) for most practical inference budgets across diverse models and datasets.
arXiv Detail & Related papers (2025-04-01T17:41:57Z) - Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.<n>We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.<n>We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z) - $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation [22.607133083903125]
In-time optimization scales computation to derive deliberate reasoning steps for effective performance.<n>We frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation.<n>Experiments show $phi$-Decoding outperforms strong baselines in both performance and efficiency.
arXiv Detail & Related papers (2025-03-17T15:38:33Z) - Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding [64.2888389315149]
Test-time scaling improves large language model performance by adding extra compute during decoding.<n>Best-of-N sampling serves as a common scaling technique, broadening the search space for finding better solutions.<n>We propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings.
arXiv Detail & Related papers (2025-03-03T11:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.