Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?
- URL: http://arxiv.org/abs/2510.01161v2
- Date: Tue, 28 Oct 2025 03:28:48 GMT
- Title: Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?
- Authors: Haizhong Zheng, Jiawei Zhao, Beidi Chen,
- Abstract summary: We show that stale data can be as informative as on-policy data if exploited properly.<n>We introduce M2PO, which constrains the second moment of importance weights to suppress only extreme outliers.<n>M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.
- Score: 34.57113614859523
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.
Related papers
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens [38.425692691443764]
ExistingReinforcement Learning (RL) fine-tuning methods rely heavily on entropy regularization and reweighting to maintain stability.<n>In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training.<n>We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01%, which we term spurious tokens.<n>We propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement.
arXiv Detail & Related papers (2026-02-17T14:46:48Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning [48.34492357368989]
We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
arXiv Detail & Related papers (2026-01-06T14:01:42Z) - Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data [89.96277093034547]
We introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization.<n>We show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training.
arXiv Detail & Related papers (2025-12-29T12:35:51Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning [33.899779762210976]
Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem.<n>Existing methods mitigate this issue with KL penalties or clipping, which passively updates rather than actively reducing the gap.<n>We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap before training.
arXiv Detail & Related papers (2025-09-18T17:02:30Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model [56.92219181993453]
We propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix) to enable on-policyRFT methods like PPO and GRPO to leverage off-policy data.<n>ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady improvement.
arXiv Detail & Related papers (2025-07-09T14:29:45Z) - ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z) - NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation [66.36912000442608]
NoisyRollout is a simple yet effective data augmentation method.<n>It mixes training trajectories from both clean and moderately distorted images.<n>It achieves state-of-the-art performance among open-source RL-tuned models.
arXiv Detail & Related papers (2025-04-17T16:10:13Z) - MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL [20.22674077197914]
Recent work has explored updating neural networks with large numbers of gradient steps for every new sample.<n>High update-to-data ratios introduce instability to the training process.<n>Our method, Model-Augmented Data for TD Learning (MAD-TD), uses small amounts of generated data to stabilize high UTD training.
arXiv Detail & Related papers (2024-10-11T15:13:17Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.