Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
- URL: http://arxiv.org/abs/2512.01374v3
- Date: Wed, 03 Dec 2025 15:17:19 GMT
- Title: Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
- Authors: Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, Junyang Lin,
- Abstract summary: We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
- Score: 61.361819972410046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
Related papers
- ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning [75.73135757250806]
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.<n>Despite encouraging early results, ARL remains highly unstable, often leading to training collapse.<n>In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting.
arXiv Detail & Related papers (2026-02-25T03:43:34Z) - VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training [18.849117699859622]
Training stability is a central challenge in reinforcement learning for large language models.<n>We propose Variational sEquence-level Soft Policy Optimization (VESPO)<n> Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution.
arXiv Detail & Related papers (2026-02-11T09:48:08Z) - Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [29.56905427210088]
gradient-ARM is a framework that jointly optimize a rubric generator and a judge using reinforcement learning from preference feedback.<n>We show that gradient-ARM achieves state-of-the-art performance among baselines on benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
arXiv Detail & Related papers (2026-02-02T00:50:53Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization [9.358876832727239]
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades.<n>We introduce M-GRPO, a framework that leverages a slowly evolving momentum model to provide a stable training target.<n>We also propose an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories.
arXiv Detail & Related papers (2025-12-15T08:07:23Z) - RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization [65.23034604711489]
We introduce RLoop, a self-improving framework for training large reasoning models.<n>RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset.<n>Our experiments show RLoops forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
arXiv Detail & Related papers (2025-11-06T11:27:16Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z) - CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models [11.295986905174635]
Rule-based reinforcement learning (RL) has improved the reasoning capability of language models (LMs) with rule-based rewards.<n>Existing RL methods often suffer from training instability, where large policy updates and improper clipping can lead to training collapse.<n>We propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs.
arXiv Detail & Related papers (2025-05-18T17:44:53Z) - Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [2.743898388459522]
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve.<n>Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments.<n>We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps.
arXiv Detail & Related papers (2024-10-16T14:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.