Fugu-MT 論文翻訳(概要): Soft Adaptive Policy Optimization

論文の概要: Soft Adaptive Policy Optimization

arxiv url: http://arxiv.org/abs/2511.20347v2
Date: Mon, 01 Dec 2025 12:02:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 15:37:38.331331
Title: Soft Adaptive Policy Optimization
Title（参考訳）: ソフトアダプティブポリシー最適化
Authors: Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin,
Abstract要約: 強化学習は、大規模言語モデルの推論能力を高める上で、ますます重要な役割を担っている。 GSPOやGRPOのような既存のグループベースのポリシー最適化手法は、ハードクリッピングによってこの問題を軽減する。ハードクリッピングをスムーズな温度制御ゲートに置き換えるソフト適応ポリシー最適化(SAPO)を提案する。
参考スコア（独自算出の注目度）: 67.61886077470528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
Abstract（参考訳）: 強化学習(RL)は,大規模言語モデル(LLM)の推論能力を高める上で,ますます重要な役割を担っている。トークンレベルの重要度比はしばしば、不安定な更新に導かれるMixture-of-Expertsモデルで悪化する高分散現象を示す。 GSPOやGRPOのような既存のグループベースのポリシー最適化手法は、この問題をハードクリッピングによって緩和し、安定性と効果的な学習の両立を困難にしている。ハードクリッピングをスムーズな温度制御ゲートに置き換えたソフトアダプティブポリシー最適化(SAPO)を提案する。 GSPOやGRPOと比較すると、SAPOはシーケンスコヒーレントかつトークン適応的である。 GSPOと同様に、SAPOはシーケンスレベルのコヒーレンスを維持しているが、そのソフトゲーティングは、GSPOで使用される脆いハードクリッピングバンドを避けるための、継続的な信頼領域を形成する。 GSPOはそのシーケンスのすべての勾配を抑えるのに対し、SAPOは攻撃トークンのみを選択的に下降させ、ほぼ政治トークンからの学習信号を保存し、サンプル効率を向上させる。 GRPOとは対照的に、SAPOはハードトークンレベルのクリッピングをスムーズで温度制御されたスケーリングに置き換え、より情報的で安定した更新を可能にする。数学的推論ベンチマークによる実験結果から,SAPOはトレーニングの安定性が向上し,Pass@1の性能が向上していることが示された。さらに、我々は、Qwen3-VLモデルシリーズのトレーニングにSAPOを使用し、様々なタスクと異なるモデルサイズでSAPOが一貫したパフォーマンス向上をもたらすことを示した。全体として、SAPOはLLMのRLトレーニングに対して、より信頼性が高く、スケーラブルで、効果的な最適化戦略を提供します。

論文の概要: Soft Adaptive Policy Optimization

関連論文リスト