LLMs Can Learn to Reason Via Off-Policy RL
- URL: http://arxiv.org/abs/2602.19362v2
- Date: Fri, 27 Feb 2026 20:02:52 GMT
- Title: LLMs Can Learn to Reason Via Off-Policy RL
- Authors: Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, Wen Sun,
- Abstract summary: Reinforcement learning approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO.<n>We propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL)<n>OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.
- Score: 17.2941334301927
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference policies break this assumption, making the data off-policy by design. To rectify this, prior work has focused on making this off-policy data appear more on-policy, either via importance sampling (IS), or by more closely aligning the training and inference policies by explicitly modifying the inference engine. In this work, we embrace off-policyness and propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL). We show that OAPL outperforms GRPO with importance sampling on competition math benchmarks, and can match the performance of a publicly available coding model, DeepCoder, on LiveCodeBench, while using 3x fewer generations during training. We further empirically demonstrate that models trained via OAPL have improved test time scaling under the Pass@k metric. OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.
Related papers
- Align and Filter: Improving Performance in Asynchronous On-Policy RL [27.989398323927393]
We identify the sources of policy lag caused by distributed learning and high update frequency.<n>We propose textittotal Variation-based Advantage aligned Constrained policy Optimization as a practical approach to mitigate policy lag.
arXiv Detail & Related papers (2026-03-02T01:52:34Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models [67.41779761651924]
SOUP is a framework that unifies off- and on-policy learning within individual samples at the token level.<n>It consistently outperforms standard on-policy training and existing off-policy extensions.
arXiv Detail & Related papers (2026-01-29T09:56:15Z) - Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends [64.71326476563213]
Off-policy reinforcement learning for large language models (LLMs) is attracting growing interest.<n>We present a first-principles derivation for grouprelative REINFORCE without assuming a specific training data distribution.<n>This perspective yields two general principles for adapting REINFORCE to off-policy settings.
arXiv Detail & Related papers (2025-09-29T02:34:54Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Soft Policy Optimization: Online Off-Policy RL for Sequence Models [42.95110169230739]
Post-training of language models is almost exclusively done using on-policy methods such as PPO.<n>SPO is a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories.
arXiv Detail & Related papers (2025-03-07T14:23:40Z) - Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL)<n>We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [43.77763433288893]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.<n>We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.<n>We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.