Fugu-MT 論文翻訳(概要): Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

論文の概要: Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

arxiv url: http://arxiv.org/abs/2606.14792v1
Date: Thu, 11 Jun 2026 07:33:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:32.160636
Title: Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model
Title（参考訳）: 離散拡散モデルを用いた視覚的テキスト思考のための効率的な強化
Authors: Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji,
Abstract要約: マルチモーダル離散拡散モデル(英語版)は、インターリーブ推論における強化学習のためのARモデルの効果的な代替手段である。共同報酬代入は、モダリティ間で共有報酬信号を使用し、RL更新中に相互干渉を導入する。テキストと視覚セグメントに独立して報酬を割り当てる戦略である因子化報酬割り当てを提案する。
参考スコア（独自算出の注目度）: 70.56994065819471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.
Abstract（参考訳）: RLベースのポストトレーニングは、テキストと画像の両方を生成できる統一マルチモーダルモデルにおいて、インターリーブされた視覚的およびテキスト的推論を可能にするために広く採用されている。しかし、既存のほとんどのアプローチは自己回帰(AR)統一モデルに基づいて構築されており、視覚的推論において完全な画像再生が必要である。本研究では,マルチモーダル離散拡散モデルが,画像の完全再生ではなく,局所的な視覚的編集によって効率的な視覚的ロールアウトを行う能力により,相互開き推論における強化学習のためのARモデルの効果的な代替手段であることを示す。これにより、GRPO中のロールアウト計算がARベースラインに比べて26.9%削減され、パフォーマンスが低下する。効率が向上したにもかかわらず、共同報酬代入はモダリティ間で共有報酬信号を用いており、RL更新中に非関連画像とテキストトークンシーケンス間の相互干渉を導入している。この問題に対処するために、テキストや視覚セグメントに独立して報酬を割り当てる戦略である分解報酬割当てを提案する。因子的報酬割り当てでは、我々のRLアプローチは、共同報酬割り当てよりも11.2%改善し、ベースモデルより38.04%改善した。

論文の概要: Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

関連論文リスト