Fugu-MT 論文翻訳(概要): OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

論文の概要: OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

arxiv url: http://arxiv.org/abs/2604.04142v1
Date: Sun, 05 Apr 2026 15:00:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.943166
Title: OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
Title（参考訳）: OP-GRPO:フローマッチングモデルのための効率的なオフポリティGRPO
Authors: Liyu Zhang, Kehan Li, Tingrui Han, Tao Zhao, Yuxuan Sheng, Shibo He, Chao Li,
Abstract要約: 本稿では,フローマッチングモデルに適したOff-Policy GRPOフレームワークであるOP-GRPOを提案する。高品質なトラジェクトリを積極的に選択し、それらをリプレイバッファに適応的に組み込んで、その後のトレーニングイテレーションで再利用する。
参考スコア（独自算出の注目度）: 14.396100082949005
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.
Abstract（参考訳）: GRPOによるポストトレーニングは,フローマッチングモデルの生成品質向上に顕著な効果を示した。しかし、GRPOは本来、政策上の訓練パラダイムのため、サンプル効率の低下に悩まされている。この制限に対処するため,フローマッチングモデルに適した最初のOff-Policy GRPOフレームワークであるOP-GRPOを提案する。まず、我々は、高品質な軌道を積極的に選択し、それらをリプレイバッファに適応的に組み込んで、その後のトレーニングイテレーションで再利用する。第2に,非政治サンプルによる分散シフトを軽減するために,GRPOのクリッピング機構の整合性を維持しつつ,安定なポリシー更新を確実にするシーケンスレベルの重要度サンプリング補正を提案する。第3に, 遅延復調段階が不調な非政治比率を生じることを理論的, 実証的に示し, 遅延段階における軌道の切り離しによってこれを緩和する。画像およびビデオ生成ベンチマーク全体を通じて、OP-GRPOは、平均的なトレーニングステップの34.2%でFlow-GRPOと同等または優れたパフォーマンスを達成し、生成品質を維持しながら、トレーニング効率を大幅に向上させる。

論文の概要: OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

関連論文リスト