Fugu-MT 論文翻訳(概要): Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

論文の概要: Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

arxiv url: http://arxiv.org/abs/2605.10937v1
Date: Mon, 11 May 2026 17:59:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:51.062846
Title: Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Title（参考訳）: 超線形アドバンテージ整形によるテキスト・画像モデルのパワー強化
Authors: Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, Xueqian Wang, Shijian Lu,
Abstract要約: テキスト・トゥ・イメージ(T2I)モデルのポストトレーニング手法はハッキングに報いる傾向がある。 SLAS(Super-Linear Advantage Shaping)は、地方政策の分野を再考する。 SLASは、DanceGRPOベースラインを複数のバックボーンとベンチマークで一貫して上回っている。
参考スコア（独自算出の注目度）: 66.25536973294726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.
Abstract（参考訳）: 近年,グループ相対政策最適化(GRPO)に着目した強化学習に基づくポストトレーニング手法が,テキスト・ツー・イメージ(T2I)モデルのさらなる発展のための堅牢なパラダイムとして登場した。しかし、これらの手法はハッキングに報酬を与える傾向があり、モデルでは真のパフォーマンス向上ではなく、不完全な報酬関数のバイアスを悪用する。本研究は,正規化が誤校正を招き,即時標準偏差項を直接取り除くことにより,優位性において線形だが真の信号のノイズ分離を制限した最適方針上昇方向が得られることを示す。上記の問題を緩和するために,情報幾何学の観点から機能更新を再考することにより,SLAS(Super-Linear Advantage Shaping)を提案する。アドバンテージ依存重み付けでフィッシャー・ラオ情報計量を拡張することにより、SLASは局所的な政策空間を再評価する非線形幾何学構造を導入する。この設計は、高アドバンテージ方向に沿って制約を緩和し、情報更新を増幅し、低アドバンテージ領域の制約を緩和し、照明勾配を抑制する。さらに, バッチレベルの正規化を適用し, 様々な報酬尺度でトレーニングを安定させる。大規模な評価は、SLASがDanceGRPOベースラインを複数のバックボーンとベンチマークで一貫して上回っていることを示している。特に、より高速なトレーニングのダイナミクス、GenEvalとUniGenBench++でのドメイン外パフォーマンスの改善、モデルのスケーリングに対する堅牢性の向上、報酬のハッキングの軽減、セマンティックとコンポジションの忠実さの保存などを実現している。

論文の概要: Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

関連論文リスト