Trust but Verify! A Survey on Verification Design for Test-time Scaling
- URL: http://arxiv.org/abs/2508.16665v3
- Date: Tue, 09 Sep 2025 12:54:42 GMT
- Title: Trust but Verify! A Survey on Verification Design for Test-time Scaling
- Authors: V Venktesh, Mandeep Rathee, Avishek Anand,
- Abstract summary: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models.<n>Verifiers serve as reward models that help score the candidate outputs from the decoding process.<n>Verifiers could be prompt-based, fine-tuned as a discriminative or generative model.
- Score: 8.428618801719198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.
Related papers
- interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors [47.363850513075356]
We present a test-time verification framework, interwhen, that ensures that the output of a reasoning model is valid wrt. a given set of verifiers.<n> Verified reasoning is an important goal in high-stakes scenarios such as deploying agents in the physical world.
arXiv Detail & Related papers (2026-02-05T08:35:01Z) - Exploring Test-time Scaling via Prediction Merging on Large-Scale Recommendation [13.057539100440634]
How to efficiently utilize and scale up computational resources during test time remains underexplored.<n>Key point in applying test-time scaling to DLRS lies in effectively generating diverse yet meaningful outputs.<n>Test-time scaling can be seamlessly accelerated with the increase in parallel servers when deployed online.
arXiv Detail & Related papers (2025-12-08T15:41:10Z) - Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering [51.7496756448709]
Language models (LMs) perform well on coding benchmarks but struggle with real-world software engineering tasks.<n>Existing approaches rely on supervised fine-tuning with high-quality data, which is expensive to curate at scale.<n>We propose Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process.
arXiv Detail & Related papers (2025-05-29T16:15:36Z) - Value-Guided Search for Efficient Chain-of-Thought Reasoning [49.971608979012366]
We propose a simple and efficient method for value model training on long-context reasoning traces.<n>By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model.<n>We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods.
arXiv Detail & Related papers (2025-05-23T01:05:07Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [108.07030347318624]
We show that scaling with longer Chain of Thoughts (CoTs) can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.<n>We propose a Thinking- Optimal Scaling strategy to teach models to adopt different reasoning efforts for deep thinking.<n>Our self-improvement models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - Bag of Tricks for Inference-time Computation of LLM Reasoning [10.366475014241407]
We investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity.<n>Our ablation studies reveal that previously overlooked strategies can significantly enhance performance.<n>We establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks.
arXiv Detail & Related papers (2025-02-11T02:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.