Related papers: Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

URL: http://arxiv.org/abs/2509.24203v1
Date: Mon, 29 Sep 2025 02:34:54 GMT
Title: Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Authors: Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding,
Abstract summary: Off-policy reinforcement learning for large language models (LLMs) is attracting growing interest.<n>We present a first-principles derivation for grouprelative REINFORCE without assuming a specific training data distribution.<n>This perspective yields two general principles for adapting REINFORCE to off-policy settings.
Score: 64.71326476563213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.

Related papers

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization [40.41228010377401]
We propose Rephrasing Policy Optimization (RePO) to reconcile off-policy knowledge with the stability of on-policy RL.<n>RePO rephrases off-policy knowledge into trajectories that conform to its own stylistic and parametric distribution.<n> Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines.
arXiv Detail & Related papers (2026-02-11T13:02:40Z)
Interpret Policies in Deep Reinforcement Learning using SILVER with RL-Guided Labeling: A Model-level Approach to High-dimensional and Multi-action Environments [3.905774454930983]
Deep reinforcement learning achieves remarkable performance but lacks interpretability.<n> SILVER framework explains RL policy via Shapley-based regression but remains restricted to low-dimensional, binary-action domains.<n>We propose SILVER with RL-guided labeling, an enhanced variant that extends SILVER to multi-action and high-dimensional environments.
arXiv Detail & Related papers (2025-10-22T05:04:43Z)
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting [40.80967570661867]
Adapting language models to new tasks via post-training carries the risk of degrading existing capabilities.<n>We compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL)<n>RL leads to less forgetting than SFT while achieving comparable or higher target task performance.
arXiv Detail & Related papers (2025-10-21T17:59:41Z)
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting [71.64063986651819]
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs)<n>Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data.<n>We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting.
arXiv Detail & Related papers (2025-08-15T11:20:03Z)
Dynamics of Resource Allocation in O-RANs: An In-depth Exploration of On-Policy and Off-Policy Deep Reinforcement Learning for Real-Time Applications [0.6752538702870792]
This paper investigates the application of two DRL models, on-policy and off-policy, in the field of resource allocation for Open Radio Access Networks (O-RAN)<n>Motivated by the original work of Nessrine Hammami and Kim Khoa Nguyen, this study is a replication to validate and prove the findings.
arXiv Detail & Related papers (2024-11-17T17:46:40Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. We show that an existing model-based RL algorithm already produces significant gains in the offline setting. We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.