Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
- URL: http://arxiv.org/abs/2505.04842v1
- Date: Wed, 07 May 2025 22:41:26 GMT
- Title: Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
- Authors: Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini,
- Abstract summary: We propose RL$V$ that augments any value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier.<n> RL$V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32times$ efficient test-time compute scaling.
- Score: 57.95157497749428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.
Related papers
- Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning [60.67176246634741]
We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem.<n>We show that state-of-the-art models do not minimize regret, but one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL.
arXiv Detail & Related papers (2025-03-10T17:40:43Z) - $Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training [60.01594991938747]
$Qsharp$ is a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function.<n>Our results highlight $Qsharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees.
arXiv Detail & Related papers (2025-02-27T21:43:00Z) - Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$ [12.111848705677142]
We propose RL$3$, a hybrid approach that incorporates action-values, learned per task via traditional RL, in the inputs to meta-RL.<n>We show that RL$3$ earns greater cumulative reward in the long term compared to RL$2$ while drastically reducing meta-training time and generalizes better to out-of-distribution tasks.
arXiv Detail & Related papers (2023-06-28T04:16:16Z) - LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement
Learning [78.2286146954051]
LCRL implements model-free Reinforcement Learning (RL) algorithms over unknown Decision Processes (MDPs)
We present case studies to demonstrate the applicability, ease of use, scalability, and performance of LCRL.
arXiv Detail & Related papers (2022-09-21T13:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.