Fugu-MT 論文翻訳(概要): Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

論文の概要: Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

arxiv url: http://arxiv.org/abs/2606.01636v1
Date: Mon, 01 Jun 2026 03:41:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.903615
Title: Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition
Title（参考訳）: Pave-GRPO: 原理的平均速度分解による瞬時誘導を超えて
Authors: Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu, Zihan Zhang, Yi Jin, Huaian Chen, Yuhang Zang,
Abstract要約: グループ相対政策最適化は、フローベースの生成モデルと人間の嗜好を整合させる強力なパラダイムとして登場した。原理的平均速度分解によりGRPOの目的を再構築するPave-GRPOを提案する。
参考スコア（独自算出の注目度）: 43.9250042009344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.
Abstract（参考訳）: グループ相対政策最適化(GRPO)によるポストトレーニングは、フローベース生成モデルと人間の嗜好を整合させる強力なパラダイムとして登場した。しかし、フローモデルの反復的デノイングの性質は、ポリシーの段階的な更新のためにグループロールアウトを生成する際にかなりのコストを発生させ、既存の手法を極めて少ないデノイングステップでトレーニングするように説得する。報酬フィードバックは軌道当たりのほんの数段階にしか達せず、ほとんどの中間段階は直接監督することなく、アライメントの粒度を妥協する。そこで本研究では, GRPO の目的を原理的平均速度分解により再構成する Pave-GRPO を提案する。高価な高段ロールアウトを生成するのではなく、効率の良い数段グループサンプリングを維持しながら、各粗い遷移を複数の中間段階にまたがるより微細なサブ軌道の等価アンサンブルに分解する。これにより、追加生成コストを伴わずに、より包括的な選好アライメントを実現するために、より密集した時間段階に報酬フィードバックを伝達する。このデザインには2つの利点がある。 (i)ゼロコスト地平線拡大:ピースワイズグループサンプルの直接再利用とそれに伴う報酬により、Pave-GRPOは固定サンプリング予算下での効率的な最適化範囲を著しく拡大する。 (II) 総合的な時間的監督: 即時速度目標をマルチステップアンサンブルに等価に分解することにより、復調過程のより中間的な段階に報酬信号を分散し、よりきめ細やかな選好最適化を可能にする。大規模な実験により、Pave-GRPOは様々な報酬設定の優先順位調整を効果的に進め、総合的なパフォーマンス向上を提供する。

論文の概要: Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

関連論文リスト