Fugu-MT 論文翻訳(概要): MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

論文の概要: MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

arxiv url: http://arxiv.org/abs/2604.06966v1
Date: Wed, 08 Apr 2026 11:30:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.497192
Title: MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
Title（参考訳）: MAR-GRPO:AR拡散ハイブリッド画像生成のための安定化GRPO
Authors: Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu, Aiming Hao, Jiahong Wu, Xiangxiang Chu, Feng Zhao,
Abstract要約: 強化学習(RL)は自己回帰(AR)と拡散モデルにうまく応用されている。 RLをハイブリッドAR拡散フレームワークに拡張することは、インターリーブ推論とノイズの多いログ確率推定のために依然として難しい。本研究では,マスク付き自己回帰モデル(MAR)について検討し,拡散ヘッドが運動学のトレーニングにおいて重要な役割を担っていることを示す。
参考スコア（独自算出の注目度）: 24.618644100413018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: https://github.com/AMAP-ML/mar-grpo.
Abstract（参考訳）: 強化学習(RL)は自己回帰(AR)と拡散モデルにうまく応用されている。しかし、RLをハイブリッドAR拡散フレームワークに拡張することは、インターリーブ推論とノイズの多い対数確率推定のために依然として困難である。本研究では,マスク付き自己回帰モデル (MAR) について検討し,拡散ヘッドが運動学のトレーニングにおいて重要な役割を担っていることを示す。この問題に対処するため、我々はMARのための安定化されたRLフレームワークを提案する。我々は,複数の拡散軌道を平均化することで最適化方向を推定し,拡散誘起勾配雑音を低減するマルチトラジェクトリ予測(MTE)を導入する。過度な平滑化を避けるため、複数の軌道からトークン単位の不確実性を推定し、複数軌道最適化をトップk%の不確実性トークンにのみ適用する。さらに、最終的な生成されたコンテンツと一致しないARトークンをフィルタリングする、一貫性を意識したトークン選択戦略を導入する。複数のベンチマークによる大規模な実験により,本手法はベースラインGRPOおよびプレRLモデル上での視覚的品質,訓練安定性,空間構造理解を一貫して改善することが示された。コードは、https://github.com/AMAP-ML/mar-grpo.comで入手できる。

論文の概要: MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

関連論文リスト