Fugu-MT 論文翻訳(概要): V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

論文の概要: V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

arxiv url: http://arxiv.org/abs/2604.23380v1
Date: Sat, 25 Apr 2026 17:03:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.305454
Title: V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
Title（参考訳）: V-GRPO: 生成モデルを識別するオンライン強化学習は、あなたが考えるよりも簡単
Authors: Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy,
Abstract要約: 本稿では,ELBOをベースとしたサロゲートとグループ相対ポリシー最適化アルゴリズムを統合した変分GRPOを提案する。 V-GRPOはテキストと画像の合成において最先端のパフォーマンスを実現し、MixGRPOよりも2倍のスピードアップ、DiffusionNFTより3倍のスピードアップを実現している。
参考スコア（独自算出の注目度）: 90.69263509098948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.
Abstract（参考訳）: 人間の好みや検証可能な報酬で生成モデルを認知させることは、依然として重要な課題である。政策段階のオンライン強化学習(RL)は、原則的なポストトレーニングフレームワークを提供するが、その直接的な応用は、これらのモデルの難易度によって妨げられている。したがって、従来の研究は、安定だが非効率なサンプリング軌道よりも誘導マルコフ決定過程(MDP)を最適化するか、あるいは、これまで視覚発生にはあまり優れていなかった拡散エビデンスローバウンド(ELBO)に基づいて、潜在的サロゲートを使用するかのどちらかである。私たちの重要な洞察は、ELBOベースのアプローチは、実際、安定かつ効率的にできるということです。シュロゲート分散の低減と勾配の制御により,本手法がMDP法に勝ることを示す。そこで本研究では,ELBOをベースとしたサロゲートをグループ相対政策最適化(GRPO)アルゴリズムと組み合わせた変分GRPO(V-GRPO)を提案する。本手法は実装が容易で,事前学習対象と整合し,MDPに基づく手法の限界を回避する。 V-GRPOはテキストと画像の合成において最先端のパフォーマンスを実現し、MixGRPOよりも2ドル以上、DiffusionNFTより3ドル以上スピードアップする。

論文の概要: V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

関連論文リスト