SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
- URL: http://arxiv.org/abs/2501.19306v3
- Date: Fri, 23 May 2025 04:26:30 GMT
- Title: SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
- Authors: Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, Sercan Ö Arık,
- Abstract summary: This paper introduces Self-Enhanced Test-Time Scaling (SETS), a new approach that overcomes limitations by strategically combining parallel and sequential techniques.<n>SETS exploits the inherent self-verification and self-correction capabilities of Large Language Models, unifying sampling, verification, and correction within a single framework.
- Score: 44.11609084435251
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing parallel scaling methods, such as repeated sampling or reward model scoring, often suffer from premature convergence and high costs due to task-specific reward model training, while sequential methods like SELF-REFINE cannot effectively leverage increased compute. This paper introduces Self-Enhanced Test-Time Scaling (SETS), a new approach that overcomes these limitations by strategically combining parallel and sequential techniques. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This innovative design facilitates efficient and scalable test-time computation for enhanced performance on complex tasks. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
Related papers
- Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z) - Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z) - DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling [20.605487145370752]
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation.<n>Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints.<n>We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework.
arXiv Detail & Related papers (2025-06-19T05:40:54Z) - Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z) - Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier [13.980380294971093]
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency.<n>We introduce FlexiVe, a novel generative verifier that balances flexibly computational resources between rapid, reliable fast thinking and meticulous slow thinking.<n>Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench.
arXiv Detail & Related papers (2025-05-17T11:41:44Z) - SEVA: Leveraging Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation [29.441669360316418]
Test-Time adaptation (TTA) aims to enhance model robustness against distribution shifts through rapid model adaptation during inference.<n> augmentation strategies can effectively unleash the potential of reliable samples, but the rapidly growing computational cost impedes their real-time application.<n>We propose a novel TTA approach named Single-step Ensemble of Vicinal Augmentations (SEVA) which can take advantage of data augmentations without increasing the computational burden.
arXiv Detail & Related papers (2025-05-07T02:58:37Z) - T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models [9.674458633565111]
We investigate whether small language models (sLMs) can reliably self-verify their outputs under test-time scaling.
We propose Tool-integrated self-verification (T1), which delegates-heavy verification steps to external tools, such as a code interpreter.
Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance.
arXiv Detail & Related papers (2025-04-07T04:01:17Z) - Adaptive Rectification Sampling for Test-Time Compute Scaling [5.085583751997239]
We propose Adaptive Rectification Sampling (AR-Sampling) to guide large language models to self-correction.
Our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions.
arXiv Detail & Related papers (2025-04-02T02:57:52Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.
We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.
Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.
However, improvement is plateauing due to the exhaustion of readily available high-quality data.
We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Iterative Deepening Sampling for Large Language Models [27.807695570974644]
Training models to achieve effective self-correction and self-correction remains a significant challenge.
We propose a novel iterative sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z) - S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising approach to harnessing the power of pre-trained models for sequential tasks.<n>We propose a Scalable Low-Rank Adaptation (S-LoRA) method for CL (in particular class incremental learning), which incrementally decouples the learning of the direction and magnitude of LoRA parameters.<n>Our theoretical and empirical analysis demonstrates that S-LoRA tends to follow a low-loss trajectory that converges to an overlapped low-loss region, resulting in an excellent stability-plasticity trade-off in CL.
arXiv Detail & Related papers (2025-01-22T20:00:41Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.
Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.
We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC)
We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions.
This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z) - Active Testing of Large Language Model via Multi-Stage Sampling [17.89896012553348]
AcTracer is an active testing framework tailored for large language models (LLMs)
It strategically selects a small subset of test data to achieve a nearly optimal performance estimation.
Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods.
arXiv Detail & Related papers (2024-08-07T06:17:48Z) - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data.
We propose an active sample selection criterion to identify reliable and non-redundant samples.
We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.