Related papers: SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

URL: http://arxiv.org/abs/2601.21476v1
Date: Thu, 29 Jan 2026 09:56:15 GMT
Title: SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models
Authors: Lei Yang, Wei Bi, Chenxi Sun, Renren Jin, Deyi Xiong,
Abstract summary: SOUP is a framework that unifies off- and on-policy learning within individual samples at the token level.<n>It consistently outperforms standard on-policy training and existing off-policy extensions.
Score: 67.41779761651924
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.

Related papers

LLMs Can Learn to Reason Via Off-Policy RL [17.2941334301927]
Reinforcement learning approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO.<n>We propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL)<n>OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.
arXiv Detail & Related papers (2026-02-22T22:12:51Z)
Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs [19.079556051442168]
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks.<n>But for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly noisy.<n>We propose a stabilization method for REINFORCE/ GRPO-style algorithms that scales learning rate based on effective sample size to dampen unreliable updates.
arXiv Detail & Related papers (2026-02-19T18:40:51Z)
ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm [2.6813717321945103]
We propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization.<n>Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks.
arXiv Detail & Related papers (2026-02-10T12:29:57Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
Online on-policy preference learning algorithms for language model alignment can significantly outperform their offline counterparts.<n>We analyze how the sampling policy's coverage evolves throughout on-policy training.<n>We develop principled on-policy schemes for reward distillation in the general function class setting.
arXiv Detail & Related papers (2026-01-13T10:46:06Z)
On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z)
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling [3.5253513747455303]
We introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy.
arXiv Detail & Related papers (2023-11-14T16:37:28Z)
MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical. We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization. Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.