SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
- URL: http://arxiv.org/abs/2501.19306v4
- Date: Sat, 30 Aug 2025 05:12:37 GMT
- Title: SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
- Authors: Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, Sercan Ö Arık,
- Abstract summary: Self-Enhanced Test-Time Scaling (SETS) is a simple yet effective approach that overcomes limitations by strategically combining parallel and sequential techniques.<n>SETS exploits the inherent self-verification and self- computation capabilities of Large Language Models, unifying sampling, verification, and correction within a single framework.<n>Our results show SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
- Score: 39.57154199908565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel methods like repeated sampling are often inefficient and quickly saturate, while sequential methods like SELF-REFINE struggle to improve after a few rounds. Although combining these approaches shows promise, current methods require fine-tuned reward and revision models. This paper proposes Self-Enhanced Test-Time Scaling (SETS), a simple yet effective approach that overcomes these limitations by strategically combining parallel and sequential techniques and fully leveraging LLMs' self-improvement abilities. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This facilitates efficient and scalable test-time computation for enhanced performance on complex tasks without any model training. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
Related papers
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z) - Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z) - Efficient Test-Time Scaling for Small Vision-Language Models [14.654047034885288]
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models.<n>Existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models.<n>We propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision.
arXiv Detail & Related papers (2025-10-03T23:49:06Z) - Scale, Don't Fine-tune: Guiding Multimodal LLMs for Efficient Visual Place Recognition at Test-Time [12.659582318581606]
Current approaches, including Vision Foundation Models (VFMs) and Multimodal Large Language Models (MLLMs), enhance semantic understanding but suffer from high computational overhead and limited cross-domain transferability when fine-tuned.<n>We propose a novel framework employing Test-Time Scaling (TTS) that leverages vision-language alignment capabilities through Guidance-based methods for direct similarity scoring.<n>Our approach eliminates two-stage processing by employing structured prompts that generate length-controllable scoring outputs.
arXiv Detail & Related papers (2025-09-02T09:25:13Z) - Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling [38.27469349005585]
Test-time scaling is a powerful paradigm for enhancing the reasoning capabilities of large language models.<n>Test-time scaling is inherently inefficient due to the generation of redundant and repetitive reasoning traces.<n>We introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating test-time scaling.
arXiv Detail & Related papers (2025-08-30T01:54:55Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z) - Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z) - DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling [20.605487145370752]
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation.<n>Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints.<n>We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework.
arXiv Detail & Related papers (2025-06-19T05:40:54Z) - Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z) - Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier [13.980380294971093]
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency.<n>We introduce FlexiVe, a novel generative verifier that balances flexibly computational resources between rapid, reliable fast thinking and meticulous slow thinking.<n>Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench.
arXiv Detail & Related papers (2025-05-17T11:41:44Z) - SEVA: Leveraging Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation [29.441669360316418]
Test-Time adaptation (TTA) aims to enhance model robustness against distribution shifts through rapid model adaptation during inference.<n> augmentation strategies can effectively unleash the potential of reliable samples, but the rapidly growing computational cost impedes their real-time application.<n>We propose a novel TTA approach named Single-step Ensemble of Vicinal Augmentations (SEVA) which can take advantage of data augmentations without increasing the computational burden.
arXiv Detail & Related papers (2025-05-07T02:58:37Z) - T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models [9.674458633565111]
We investigate whether small language models (sLMs) can reliably self-verify their outputs under test-time scaling.
We propose Tool-integrated self-verification (T1), which delegates-heavy verification steps to external tools, such as a code interpreter.
Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance.
arXiv Detail & Related papers (2025-04-07T04:01:17Z) - Adaptive Rectification Sampling for Test-Time Compute Scaling [5.085583751997239]
We propose Adaptive Rectification Sampling (AR-Sampling) to guide large language models to self-correction.
Our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions.
arXiv Detail & Related papers (2025-04-02T02:57:52Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.
We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.
Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.
However, improvement is plateauing due to the exhaustion of readily available high-quality data.
We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Iterative Deepening Sampling for Large Language Models [27.807695570974644]
Training models to achieve effective self-correction and self-correction remains a significant challenge.
We propose a novel iterative sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z) - S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising approach to harnessing the power of pre-trained models for sequential tasks.<n>We propose a Scalable Low-Rank Adaptation (S-LoRA) method for CL (in particular class incremental learning), which incrementally decouples the learning of the direction and magnitude of LoRA parameters.<n>Our theoretical and empirical analysis demonstrates that S-LoRA tends to follow a low-loss trajectory that converges to an overlapped low-loss region, resulting in an excellent stability-plasticity trade-off in CL.
arXiv Detail & Related papers (2025-01-22T20:00:41Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.
Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.
We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC)
We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions.
This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z) - Active Testing of Large Language Model via Multi-Stage Sampling [17.89896012553348]
AcTracer is an active testing framework tailored for large language models (LLMs)
It strategically selects a small subset of test data to achieve a nearly optimal performance estimation.
Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods.
arXiv Detail & Related papers (2024-08-07T06:17:48Z) - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data.
We propose an active sample selection criterion to identify reliable and non-redundant samples.
We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.