Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
- URL: http://arxiv.org/abs/2508.14029v2
- Date: Wed, 20 Aug 2025 01:21:25 GMT
- Title: Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
- Authors: Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen,
- Abstract summary: We propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training.<n>This strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR.
- Score: 102.05010188302428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
Related papers
- Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training [63.34044358216334]
ACTOR-CURATOR is a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models.<n> Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines.
arXiv Detail & Related papers (2026-02-24T04:19:48Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards [13.369116707284121]
Divergence-in-Behavior Attack (DIBA) is the first membership inference framework specifically designed for Reinforcement Learning with Verifiable Rewards.<n>We show that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR.<n>This is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that training data exposure can be reliably inferred through behavioral traces.
arXiv Detail & Related papers (2025-11-18T01:51:34Z) - Revisiting Entropy in Reinforcement Learning for Large Reasoning Models [54.96908589622163]
We investigate the entropy dynamics of large language models trained withReinforcement learning with verifiable rewards (RLVR)<n>Our findings reveal that the number of off-policy updates, the diversity of training data, and the clipping thresholds in the optimization objective are critical factors influencing the entropy of LLMs trained with RLVR.
arXiv Detail & Related papers (2025-11-08T12:50:41Z) - Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models [61.78513830395669]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs)<n>As models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal.<n>We propose the Explore Residual Prompts in Policy Optimization framework, which encourages exploration on residual prompts and reactivates their training signals.
arXiv Detail & Related papers (2025-11-06T20:40:27Z) - RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization [65.23034604711489]
We introduce RLoop, a self-improving framework for training large reasoning models.<n>RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset.<n>Our experiments show RLoops forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
arXiv Detail & Related papers (2025-11-06T11:27:16Z) - Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR [25.56828724912418]
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming.<n>Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates.<n>We propose $textbfPACS$, a novel RLVR framework that achieves im$textbfP$licit $textbfA$ctor $textbfC$ritic coupling.
arXiv Detail & Related papers (2025-09-02T17:22:46Z) - ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism [11.234942110783077]
We introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning.<n>Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric.
arXiv Detail & Related papers (2025-08-15T09:49:14Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks [49.0793012627959]
We present VAPO, a novel framework tailored for reasoning models within the value-based paradigm.<n>VAPO attains a state-of-the-art score of $mathbf60.4$.<n>In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points.
arXiv Detail & Related papers (2025-04-07T14:21:11Z) - Robust Reinforcement Learning as a Stackelberg Game via
Adaptively-Regularized Adversarial Training [43.97565851415018]
Robust Reinforcement Learning (RL) focuses on improving performances under model errors or adversarial attacks.
Most of the existing literature models RARL as a zero-sum simultaneous game with Nash equilibrium as the solution concept.
We introduce a novel hierarchical formulation of robust RL - a general-sum Stackelberg game model called RRL-Stack.
arXiv Detail & Related papers (2022-02-19T03:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.