Related papers: Tool Verification for Test-Time Reinforcement Learning

Tool Verification for Test-Time Reinforcement Learning

URL: http://arxiv.org/abs/2603.02203v1
Date: Mon, 02 Mar 2026 18:57:52 GMT
Title: Tool Verification for Test-Time Reinforcement Learning
Authors: Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy,
Abstract summary: Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models.<n>We present T3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation.
Score: 70.09740926883818
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Related papers

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning [19.054149750597933]
MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning) is a framework that shifts the focus to "scaling-by-utility"<n>We introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions.<n>Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T03:22:44Z)
Test-time Recursive Thinking: Self-Improvement without External Feedback [120.80790108733942]
Test-time Recursive Thinking (TRT) is an iterative self-improvement framework.<n>Open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.
arXiv Detail & Related papers (2026-02-03T04:37:37Z)
Proof-RM: A Scalable and Generalizable Reward Model for Math Proof [67.53066972145183]
Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR)<n>Many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching.<n>To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required.
arXiv Detail & Related papers (2026-02-02T17:42:53Z)
CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
Aletheia: What Makes RLVR For Code Verifiers Tick? [51.371034079170435]
Verifiers trained via Reinforcement Learning from Verifiable Rewards (RLVR) are a prominent fixture of the Large Language Model (LLM) post-training pipeline.<n>Code verifiers remain valuable toward judging model outputs in scenarios where execution feedback is hard to obtain.<n>We examine components of the RLVR-based verifier training recipe widely credited for its success.
arXiv Detail & Related papers (2026-01-17T22:30:45Z)
SWE-RM: Execution-free Feedback For Software Engineering Agents [61.86380395896069]
Execution-based feedback is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL)<n>In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases.<n>We introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference.
arXiv Detail & Related papers (2025-12-26T08:26:18Z)
MathLedger: A Verifiable Learning Substrate with Ledger-Attested Feedback [0.0]
Contemporary AI systems achieve extraordinary performance yet remain opaque and non-verifiable.<n>We introduce MathLedger, a substrate for verifiable machine cognition that integrates formal verification, cryptographic attestation, and learning dynamics.<n>The contribution is infrastructural: a working prototype of ledger-attested learning that enables auditability at scale.
arXiv Detail & Related papers (2025-12-22T19:27:55Z)
ReVeal: Self-Evolving Code Agents via Reliable Self-Verification [11.875519107421312]
We introduce ReVeal, a reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation.<n>At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three.<n>These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
arXiv Detail & Related papers (2025-06-13T03:41:04Z)
Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection [6.471199527741301]
We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training.<n>We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation.<n>We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence.
arXiv Detail & Related papers (2025-05-26T03:54:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.