Fugu-MT 論文翻訳(概要): FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

論文の概要: FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.20256v1
Date: Mon, 18 May 2026 12:48:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.241827
Title: FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning
Title（参考訳）: FBOS-RL:フィードバック駆動型双方向合成強化学習
Authors: Xikai Zhang, Yongzhi Li, Likang Xiao, Yingze Zhang, Yanhua Cheng, Quan Chen, Peng Jiang, Wenjun Wu, Liu Liu,
Abstract要約: フィードバック駆動型双方向強化学習フレームワークFBOS-RLを提案する。具体的には、環境からのフィードバックに基づいて、フィードバックガイドによる探索強化を行う。同じロールアウト数で、FBOS-RLはGRPOやフィードバックベースのベースラインよりもかなり高速に学習する。
参考スコア（独自算出の注目度）: 16.200486964371713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy update. Unlike supervised learning, where each gradient step is anchored to an explicit ground-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high-quality rollouts drawn during the sampling stage therefore act as the implicit "teacher" that guides every parameter update. However, GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt. When a task lies beyond the policy model's current capability, this sampling scheme rarely yields a high-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which causes training to stall. To address this issue, we propose FBOS-RL, a Feedback-Driven Bi-Objective Synergistic reinforcement learning framework. Specifically, we let the model perform Feedback-Guided Exploration Enhancement based on the feedback provided by the environment, and on top of this we design two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment(EPA) and Exploration-oriented Capability Cultivation(ECC). Extensive experiments demonstrate that EPA and ECC can mutually reinforce each other, forming a positive flywheel effect that significantly improves both the training efficiency and the final performance ceiling of reinforcement learning. Specifically, under an identical number of rollouts, FBOS-RL learns substantially faster than GRPO and feedback-based baselines and ultimately attains a higher performance ceiling, while exhibiting higher policy entropy and lower gradient norms throughout training.
Abstract（参考訳）: 強化学習は、大規模モデルの推論能力の整合とアンロックの基盤となっている。 GRPOのトレーニングループとその変種は、ロールアウトサンプリングとポリシー更新の間で交互に行われる。教師付き学習とは異なり、各勾配ステップが明示的な接地構造目標に固定されている場合とは異なり、この設定でモデルパラメータを更新するための最適勾配方向はプリオリとは知られておらず、サンプリング段階で引き出された高品質なロールアウトは、全てのパラメータ更新をガイドする暗黙的な「教師」として機能する。しかし、GRPOは単純なサンプリング方式を採用し、全てのロールアウトを同じプロンプトで条件付ける。タスクが政策モデルの現在の能力を超えている場合、このサンプリングスキームは高品質なロールアウトをもたらすことは滅多になく、パラメータを更新する際に意味のある勾配方向を保たず、トレーニングが停止する。この問題に対処するため,フィードバック駆動型双方向強化学習フレームワークFBOS-RLを提案する。具体的には、環境からのフィードバックに基づいて、モデルにフィードバックガイドによる探索強化を施し、その上で、エクスプロイテーション指向の政策調整(EPA)とエクスプロレーション指向の能力育成(ECC)の2つの訓練目標を相互に強化する。広範な実験によりEPAとECCは相互に強化できることが示され、トレーニング効率と強化学習の最終的な性能天井の両方を著しく改善する正のフライホイール効果が形成される。具体的には、同じロールアウト数で、FBOS-RLはGRPOやフィードバックベースのベースラインよりも大幅に高速に学習し、最終的により高い性能の天井に達すると同時に、トレーニング全体を通して高いポリシーエントロピーと低い勾配ノルムを示す。

論文の概要: FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

関連論文リスト