Related papers: SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling

Related papers

Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z)
Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z)
DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling [20.605487145370752]
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation.<n>Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints.<n>We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework.
arXiv Detail & Related papers (2025-06-19T05:40:54Z)
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z)
Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier [13.980380294971093]
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency.<n>We introduce FlexiVe, a novel generative verifier that balances flexibly computational resources between rapid, reliable fast thinking and meticulous slow thinking.<n>Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench.
arXiv Detail & Related papers (2025-05-17T11:41:44Z)
SEVA: Leveraging Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation [29.441669360316418]
Test-Time adaptation (TTA) aims to enhance model robustness against distribution shifts through rapid model adaptation during inference.<n> augmentation strategies can effectively unleash the potential of reliable samples, but the rapidly growing computational cost impedes their real-time application.<n>We propose a novel TTA approach named Single-step Ensemble of Vicinal Augmentations (SEVA) which can take advantage of data augmentations without increasing the computational burden.
arXiv Detail & Related papers (2025-05-07T02:58:37Z)
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models [9.674458633565111]
We investigate whether small language models (sLMs) can reliably self-verify their outputs under test-time scaling. We propose Tool-integrated self-verification (T1), which delegates-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance.
arXiv Detail & Related papers (2025-04-07T04:01:17Z)
Adaptive Rectification Sampling for Test-Time Compute Scaling [5.085583751997239]
We propose Adaptive Rectification Sampling (AR-Sampling) to guide large language models to self-correction. Our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions.
arXiv Detail & Related papers (2025-04-02T02:57:52Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models. We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Iterative Deepening Sampling for Large Language Models [27.807695570974644]
Training models to achieve effective self-correction and self-correction remains a significant challenge. We propose a novel iterative sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z)
S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising approach to harnessing the power of pre-trained models for sequential tasks.<n>We propose a Scalable Low-Rank Adaptation (S-LoRA) method for CL (in particular class incremental learning), which incrementally decouples the learning of the direction and magnitude of LoRA parameters.<n>Our theoretical and empirical analysis demonstrates that S-LoRA tends to follow a low-loss trajectory that converges to an overlapped low-loss region, resulting in an excellent stability-plasticity trade-off in CL.
arXiv Detail & Related papers (2025-01-22T20:00:41Z)
Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening. Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training. We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z)
Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC) We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions. This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z)
Active Testing of Large Language Model via Multi-Stage Sampling [17.89896012553348]
AcTracer is an active testing framework tailored for large language models (LLMs) It strategically selects a small subset of test data to achieve a nearly optimal performance estimation. Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods.
arXiv Detail & Related papers (2024-08-07T06:17:48Z)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z)
Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data. We propose an active sample selection criterion to identify reliable and non-redundant samples. We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.