Soft Sequence Policy Optimization
- URL: http://arxiv.org/abs/2602.19327v2
- Date: Thu, 26 Feb 2026 08:54:51 GMT
- Title: Soft Sequence Policy Optimization
- Authors: Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko,
- Abstract summary: We introduce Soft Sequence Policy Optimization (SSPO) as an off-policy reinforcement learning objective.<n>SSPO incorporates soft gating functions over token-level probability ratios within sequence-level importance weights.<n>We show that SSPO improves training stability and performance in mathematical reasoning tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i) a shift toward sequence-level importance sampling weights that better align with the sequence-level rewards used in many tasks, and (ii) alternatives to PPO-style clipping that aim to avoid the associated loss of training signal and entropy collapse. We introduce Soft Sequence Policy Optimization, an off-policy reinforcement learning objective that incorporates soft gating functions over token-level probability ratios within sequence-level importance weights. We provide theoretical motivation for SSPO and investigate practical modifications to improve optimization behavior. Empirically, we show that SSPO improves training stability and performance in mathematical reasoning tasks.
Related papers
- A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - Soft Adaptive Policy Optimization [67.61886077470528]
Reinforcement learning plays an increasingly important role in enhancing the reasoning capabilities of large language models.<n>Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping.<n>We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate.
arXiv Detail & Related papers (2025-11-25T14:25:19Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning [17.928214942495412]
ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase.<n>We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro.<n>Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
arXiv Detail & Related papers (2025-10-01T09:11:27Z) - Improving DAPO from a Mixed-Policy Perspective [6.237966553429477]
This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm.<n>We first propose a method that incorporates a pre-trained, stable guiding policy to provide off-policy experience.<n>We then extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies.
arXiv Detail & Related papers (2025-07-17T09:12:09Z) - A dynamical clipping approach with task feedback for Proximal Policy Optimization [29.855219523565786]
There is no theoretical proof that the optimal PPO clipping bound remains consistent throughout the entire training process.
Past studies have aimed to dynamically adjust PPO clipping bound to enhance PPO's performance.
We propose Preference based Proximal Policy Optimization (Pb-PPO) to better reflect the preference (maximizing Return) of reinforcement learning tasks.
arXiv Detail & Related papers (2023-12-12T06:35:56Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Coordinated Proximal Policy Optimization [28.780862892562308]
Coordinated Proximal Policy Optimization (CoPPO) is an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting.
We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective.
We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies.
arXiv Detail & Related papers (2021-11-07T11:14:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.