Fugu-MT 論文翻訳(概要): UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

論文の概要: UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

arxiv url: http://arxiv.org/abs/2604.18518v2
Date: Tue, 21 Apr 2026 03:05:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 14:04:47.953166
Title: UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
Title（参考訳）: UDM-GRPO:一様離散拡散モデルに対する安定かつ効率的なグループ相対ポリシー最適化
Authors: Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang,
Abstract要約: RL と UDM を統合した最初のフレームワークである UDM-GRPO を提案する。提案手法は2つの重要な知見により導かれる: (i) 最終クリーンサンプルをより正確で安定した最適化信号として扱い、 (ii) 拡散前処理による軌道の再構築により、予測経路と事前学習分布との整合性が向上する。
参考スコア（独自算出の注目度）: 35.98585605462306
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.
Abstract（参考訳）: 離散離散拡散モデル (UDM) は離散生成モデルのための将来的なパラダイムとして最近登場したが、強化学習との統合は未解明のままである。 GRPOをUDMに適用することで,トレーニング不安定性と限界性能向上につながることが観察された。そこで本研究では,UDMとRLを統合する最初のフレームワークであるUDM-GRPOを提案する。私たちの手法は2つの重要な洞察によって導かれる。一最終清浄試料をより正確で安定した最適化信号として処理すること。 (II)拡散前処理による軌道の再構築により,確率経路と事前学習分布との整合性が向上する。さらに、トレーニング効率をさらに向上させるために、Reduceed-StepとCFG-Freeという2つの戦略を導入する。 UDM-GRPOは、複数のT2Iタスクのベースモデル性能を大幅に改善する。特に、GenEvalの精度は69.%から96.%に改善され、PickScoreは20.46ドルから23.81ドルに向上し、連続的な設定と離散的な設定の両方で最先端のパフォーマンスを達成する。 OCR ベンチマークでは,精度が 8 % から 57 % に上昇し,さらに本手法の一般化能力を検証した。コードはhttps://github.com/Yovecent/UDM-GRPO.comで入手できる。

論文の概要: UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

関連論文リスト