Group Sequence Policy Optimization
- URL: http://arxiv.org/abs/2507.18071v2
- Date: Mon, 28 Jul 2025 11:11:33 GMT
- Title: Group Sequence Policy Optimization
- Authors: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin,
- Abstract summary: Group Sequence Policy Optimization (GSPO) is a stable, efficient, and performant reinforcement learning algorithm.<n>GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.
- Score: 55.40088895148603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
Related papers
- Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward [10.640867597958863]
We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefixes via a Shared-Prefix Forward strategy.<n>By restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once.<n>We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO.
arXiv Detail & Related papers (2025-06-05T09:13:37Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO [68.44918104224818]
Autoregressive image generation presents unique challenges distinct from Chain-of-Thought (CoT) reasoning.<n>This study provides the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation.<n>Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms.
arXiv Detail & Related papers (2025-05-22T17:59:49Z) - Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning [36.00719049772089]
We propose the Trust Region Preference Approximation (TRPA) algorithm.<n>As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue.<n> Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability.
arXiv Detail & Related papers (2025-04-06T15:48:26Z) - Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning [0.0]
Entropy-Guided Sequence Weighting (EGSW) is a novel approach that enhances the exploration-exploitation tradeoff.<n> EGSW integrates entropy regularization with advantage-based weighting to balance policy updates.
arXiv Detail & Related papers (2025-03-28T14:07:51Z) - RL-finetuning LLMs from on- and off-policy data with a single algorithm [53.70731390624718]
We introduce a novel reinforcement learning algorithm (AGRO) for fine-tuning large-language models.<n>AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model.<n>We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence.
arXiv Detail & Related papers (2025-03-25T12:52:38Z) - PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis [9.617652261815671]
We introduce two sequence optimization strategies: the rule-based static optimization and the score-based dynamic optimization.<n>Based on the dynamic optimization structure, we propose a unified Prompt-based Generative Sequence Optimization network (named PGSO)<n>Experiments conducted on four ABSA tasks across multiple benchmarks indicate that PGSO outperforms state-of-the-art methods, with an average improvement of 3.52% in F1 score.
arXiv Detail & Related papers (2024-12-01T10:49:55Z) - Orthogonally Initiated Particle Swarm Optimization with Advanced Mutation for Real-Parameter Optimization [0.04096453902709291]
This article introduces an enhanced particle swarm (PSO), termed Orthogonal PSO with Mutation (OPSO-m)
It proposes an array-based learning approach to cultivate an improved initial swarm for PSO, significantly boosting the adaptability of swarm-based optimization algorithms.
The article further presents archive-based self-adaptive learning strategies, dividing the population into regular and elite subgroups.
arXiv Detail & Related papers (2024-05-21T07:16:20Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.