First Finish Search: Efficient Test-Time Scaling in Large Language Models
- URL: http://arxiv.org/abs/2505.18149v1
- Date: Fri, 23 May 2025 17:57:43 GMT
- Title: First Finish Search: Efficient Test-Time Scaling in Large Language Models
- Authors: Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty,
- Abstract summary: First Finish Search (FFS) is a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes.<n>FFS achieves $82.23%$ accuracy on the AIME datasets, a $15%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance.
- Score: 20.62274005080048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23\%$ accuracy on the AIME datasets, a $15\%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.
Related papers
- DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching [54.98126916293868]
Large Reasoning Models (LRMs) produce excessively long chain-of-thought traces that degrade accuracy.<n>We propose a model-agnostic decoding framework that sketches the reasoning space by branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path.<n>This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision.
arXiv Detail & Related papers (2025-11-01T17:41:28Z) - Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification [40.75612723453356]
In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them.<n>We study both sequential and parallel TTS of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation.<n>We conduct experiments with flagship open-source models and extend them to their Heavy'' variants through TTS.
arXiv Detail & Related papers (2025-10-07T17:09:23Z) - Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm [8.405729585427226]
Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs)<n>We propose the Explore-Execute Chain ($E2C$), a structured reasoning framework that decouples reasoning into two distinct phases.
arXiv Detail & Related papers (2025-09-28T15:48:40Z) - $\ exttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - Thinker: Learning to Think Fast and Slow [20.900923033065734]
We introduce a simple modification to the QA task that includes four stages: Fast Thinking, Verification, Slow Thinking, and Summarization.<n>Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B.
arXiv Detail & Related papers (2025-05-27T12:22:46Z) - Fast Quiet-STaR: Thinking Without Thought Tokens [51.79231070632772]
Fast Quiet STaR is a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost.<n>Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens.<n>Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy.
arXiv Detail & Related papers (2025-05-23T11:14:12Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement [22.801244105119025]
We propose new algorithms to improve token-efficient reasoning with small-scale models by effectively trading off accuracy and computation.<n>We first show that the post-SFT model fails to determine the optimal stopping point of the reasoning process, resulting in verbose and repetitive outputs.<n>Experiments on four reasoning benchmarks, MATH500, AMC, AIME24 and OlympiadBench, demonstrate that TS is highly effective compared to s1's budget forcing approach.
arXiv Detail & Related papers (2025-05-12T18:04:39Z) - Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking [16.441081996257576]
We propose a simple yet effective test-time scaling approach Multi-round Thinking.<n>This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds.<n>Experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements.
arXiv Detail & Related papers (2025-03-25T17:19:38Z) - START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z) - s1: Simple test-time scaling [148.4204982041058]
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance.<n>We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
arXiv Detail & Related papers (2025-01-31T18:48:08Z) - Boosting Logical Reasoning in Large Language Models through a New
Framework: The Graph of Thought [7.356034193515096]
Our paper unveils a pioneering prompting technique, dubbed textitGraph of Thoughts (GoT).
Our method outperformed GPT-4, achieving accuracy improvements of $89.7%$, $86%$, and $56%$ for each respective task.
When juxtaposed with the state-of-the-art prompting method, textitTree of Thought (ToT), our approach registered an average accuracy boost of $23%$, $24%$, and $15%$.
arXiv Detail & Related papers (2023-08-16T18:13:27Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.