Fugu-MT 論文翻訳(概要): Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

論文の概要: Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

arxiv url: http://arxiv.org/abs/2605.16842v1
Date: Sat, 16 May 2026 06:59:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.196353
Title: Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
Title（参考訳）: Sketch Then Paint: 拡散多モード大言語モデルのための階層的強化学習
Authors: Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu,
Abstract要約: 強化学習(RL)を通して拡散多モード大言語モデル(dMLLM)を最適化する方法を示す。弊社のアプローチでは、Sketch-Then-Paintトレーニングスキームにより、アップデートをグローバル、構造、洗練の3つのステージに編成する。 MMaDAとLumina-DiMOOの2つの人気のあるdMLLMバックボーンを用いた実験は、GenEvalとDPGのベンチマークで大幅に向上した。
参考スコア（独自算出の注目度）: 52.40742159500277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.
Abstract（参考訳）: 拡散多モード大言語モデル(dMLLM)は画像生成には強力だが、強化学習(RL)による最適化は依然として大きな課題である。主な難点は、多くの異なるアンマスキングシーケンスを通して単一の画像を生成することができ、計算の重要度がしばしば引き起こされることである。さらに、既存のメソッドはdMLLMの階層的生成プロセスを無視する傾向があり、初期トークンはグローバルなレイアウトを定義し、後にトークンは局所的な詳細に集中する。すべてのトークンに均一な報酬を割り当てることによって、これらの現在のメソッドは、最終的なイメージに対する各トークンの実際の貢献を反映できない。これらの問題に対処するため,我々は階層型トークンGRPO (HT-GRPO) を提案し,この階層を政策最適化プロセスに直接統合する。弊社のアプローチでは、Sketch-Then-Paintトレーニングスキームにより、アップデートをグローバル、構造、洗練の3つのステージに編成する。また、プロンプト条件付き推定器を用いて、完全にマスキングされた状態から始まる重要度を算出する。さらに、重要な構造トークンを優先し、正確な報酬伝達を保証する階層的信用割当機構を導入する。 MMaDAとLumina-DiMOOの2つの人気のあるdMLLMバックボーンを用いた実験は、HT-GRPOがGenEvalとDPGベンチマークでかなりの利益を得ることを示した。 6つの追加指標による評価は、画像の品質、美学、人間の嗜好の大幅な改善を裏付ける。

論文の概要: Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

関連論文リスト