Fugu-MT 論文翻訳(概要): Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

論文の概要: Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

arxiv url: http://arxiv.org/abs/2509.24203v1
Date: Mon, 29 Sep 2025 02:34:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.697393
Title: Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Title（参考訳）: Group-Relative ReINFORCEは、GRPOとその友人に関する謎を解き明かすオフ・ポリティクスのアルゴリズム
Authors: Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding,
Abstract要約: 大規模言語モデル(LLM)の非政治強化学習が注目されている。本稿では,特定のトレーニングデータ分布を仮定することなく,グループ化型REINFORCEの第一原理導出について述べる。この観点は、REINFORCEを非政治的な設定に適応するための2つの一般的な原則をもたらす。
参考スコア（独自算出の注目度）: 64.71326476563213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.
Abstract（参考訳）: 大規模言語モデル(LLM)のためのオフ・ポリティクス強化学習(RL)は、現実のアプリケーションにおける実践的な制約、LLM-RLインフラストラクチャの複雑さ、RL方法論のさらなる革新の必要性によって、関心が高まりつつある。古典的REINFORCEとその現代版であるGRPO(Group Relative Policy Optimization)は、通常、非政治性に制限のあるオン・ポリティクスのアルゴリズムとみなされるが、本研究では、特定のトレーニングデータ分布を仮定することなく、グループ相対REINFORCEの第一原理を導出し、ネイティブなオフ・ポリティシー解釈を認めることを示す。この観点は、ポリシー更新を規則化し、データ配布を積極的に形作るという、REINFORCEを非政治的な設定に適応するための2つの一般的な原則を生み出します。我々の分析は、GRPOにおける重要サンプリングとクリッピングの役割に関する神話をデミステレーションし、オンラインポリシーミラー・ダイアンス(OPMD)と非対称ReINFORCE(AsymRE)という2つの最近のアルゴリズムをREINFORCE損失の正規化形式として統合し再解釈し、一見ヒューリスティックなデータ重み付け戦略の理論的正当化を提供する。以上の結果から,大規模な実証研究によって検証された実効性のある知見が得られ,LLMのオフポリティ・RLにおけるアルゴリズム設計の新たな機会が開かれた。この作業のソースコードはhttps://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8kで公開されている。

論文の概要: Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

関連論文リスト