Related papers: RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

URL: http://arxiv.org/abs/2505.15034v1
Date: Wed, 21 May 2025 02:43:15 GMT
Title: RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
Authors: Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi,
Abstract summary: Tango is a novel framework that usesReinforcement learning to concurrently train both an LLM generator and a verifier.<n>A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator.<n>Experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models.
Score: 26.95555634754465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

Related papers

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation [68.19756761027351]
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models.<n>We investigate their denoising processes and reinforcement learning methods.<n>Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
arXiv Detail & Related papers (2025-06-25T17:35:47Z)
ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification [6.983144806500892]
ReVeal is a multi-turn reinforcement learning framework that interleaves code generation with explicit self-verification and tool-based evaluation.<n>It fosters the co-evolution of a model's generation and verification capabilities through RL training, expanding the reasoning boundaries of the base model.<n>It also enables test-time scaling into deeper inference regimes, with code consistently evolving as the number of turns increases during inference.
arXiv Detail & Related papers (2025-06-13T03:41:04Z)
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following [55.60192044049083]
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs)<n>We propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model.<n>We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks.
arXiv Detail & Related papers (2025-06-11T17:10:36Z)
SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [65.56911325914582]
We propose Self-play Reinforcement Learning (SeRL) to bootstrap Large Language Models (LLMs) training with limited initial data.<n>The proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards.
arXiv Detail & Related papers (2025-05-25T13:28:04Z)
Reinforced Latent Reasoning for LLM-based Recommendation [83.18146814163308]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving [27.133677615587555]
Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation.<n>While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial.
arXiv Detail & Related papers (2025-05-12T17:23:34Z)
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
Process Supervision-Guided Policy Optimization for Code Generation [15.943210767010045]
Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation.<n>We propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement.
arXiv Detail & Related papers (2024-10-23T07:22:33Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.<n>Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs.<n>We observe improvements of 28% $rightarrow$ 44.6% on MATH, and 37.9% $rightarrow$ 53.5% on MMLU abstract algebra.
arXiv Detail & Related papers (2024-08-27T17:57:45Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
Prompt Optimization via Adversarial In-Context Learning [51.18075178593142]
adv-ICL is implemented as a two-player game between a generator and a discriminator. The generator tries to generate realistic enough output to fool the discriminator. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques.
arXiv Detail & Related papers (2023-12-05T09:44:45Z)
RLTF: Reinforcement Learning from Unit Test Feedback [17.35361167578498]
Reinforcement Learning from Unit Test Feedback is a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code.
arXiv Detail & Related papers (2023-07-10T05:18:18Z)
Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. We introduce a new RL formulation for text generation from the soft Q-learning perspective. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.