Fugu-MT 論文翻訳(概要): GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

論文の概要: GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

arxiv url: http://arxiv.org/abs/2510.22319v2
Date: Thu, 30 Oct 2025 09:33:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 13:50:54.714956
Title: GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Title（参考訳）: GRPO-Guard: 調整クリッピングによるフローマッチングにおける暗黙の過度最適化の軽減
Authors: Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang,
Abstract要約: GRPO-Guardは、既存のGRPOフレームワークのシンプルで効果的な拡張である。 PPOクリッピングが有害な更新を適切に制限することを保証するため、バランスとステップ一貫性の重要度を回復する。重いKL正則化に頼ることなく、暗黙の過最適化を実質的に緩和する。
参考スコア（独自算出の注目度）: 63.33669214116784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
Abstract（参考訳）: 近年,GRPOに基づく強化学習は,フローマッチングモデルの最適化において顕著な進歩を示し,タスク固有報酬との整合性を効果的に改善している。これらのフレームワーク内では、ポリシー更新は重要度クリッピングに依存して、自信過剰な正と負の勾配を制約する。しかし、実際には、重要比分布の平均値が1を下回る体系的な変化が観察され、その差は時間経過によって大きく異なる。この左シフトの不整合分布は、正のアドバンテージサンプルがクリップされた領域に入るのを防ぎ、過信のポジティブな更新を制限するメカニズムを失敗させる。その結果、ポリシーモデルが暗黙の過度な最適化段階に入ることは避けられないが、プロキシ報酬は増加し続け、画像の品質やテキストプロンプトアライメントといった重要な指標が急激に低下し、最終的に学習されたポリシーが現実の用途では実行不可能となる。この問題に対処するため,既存のGRPOフレームワークの簡易かつ効果的な拡張であるGRPO-Guardを紹介した。本手法は,PPOクリッピングが有害な更新を適切に抑制することを保証するために,バランスの取れた,ステップ一貫性の高い重要度を回復する比率正規化を取り入れている。さらに、勾配再重み付け戦略は、ノイズ条件に対するポリシー勾配を等しくし、特定の時間ステップ領域からの過度な更新を防止する。これらの設計は、規制されたクリッピング機構として機能し、最適化を安定化し、重いKL正規化に頼ることなく暗黙的な過最適化を実質的に緩和する。複数の拡散バックボーン(例:SD3.5M、Flux.1-dev)と多様なプロキシタスクに関する大規模な実験は、GRPO-Guardが生成品質を維持したり改善したりしながら過度な最適化を著しく減少させることを示した。

論文の概要: GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

関連論文リスト