Fugu-MT 論文翻訳(概要): Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

論文の概要: Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

arxiv url: http://arxiv.org/abs/2603.09538v1
Date: Tue, 10 Mar 2026 11:49:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.260198
Title: Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
Title（参考訳）: グループ相対的政策最適化による統合マルチモーダルインターリーブ生成に向けて
Authors: Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang,
Abstract要約: 既存の統一モデルにおいて,この能力を解放するための強化学習に基づくポストトレーニング戦略を提案する。提案手法は,1つの復号軌道内でのテキスト生成と画像生成を共同でモデル化し,新たなハイブリッド報酬で最適化する。 MMIEとInterleavedBenchの実験により,マルチモーダルインターリーブド生成の品質とコヒーレンスを大幅に向上させることが実証された。
参考スコア（独自算出の注目度）: 35.14373974143734
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
Abstract（参考訳）: 統一視覚言語モデルは、マルチモーダル理解と生成において大きな進歩を遂げているが、視覚的なストーリーテリングやステップバイステップの視覚的推論といったタスクにおいて重要な機能である、多モーダルなインターリーブアウトプットの生成にはほとんど不足している。本研究では,大規模なマルチモーダル・インターリーブ・データセットに頼ることなく,既存の統一モデルでこの機能を開放するための強化学習に基づくポストトレーニング戦略を提案する。まず,複数モーダル理解とテキスト・ツー・イメージ・ジェネレーションのための,キュレートされたインターリーブド・シーケンスと限定されたデータからなるハイブリッド・データセットを用いてウォームアップ・ステージを構築し,その事前学習能力を保ちながら,インターリーブド・ジェネレーション・パターンにモデルを公開する。そこで本研究では,グループ相対政策最適化(GRPO)をマルチモーダル設定に拡張する統合ポリシ最適化フレームワークを提案する。提案手法は,1つのデコード軌道内でのテキスト生成と画像生成を共同でモデル化し,テキスト関連性,視覚的テキストアライメント,構造的忠実性を含む新たなハイブリッド報酬で最適化する。さらに、プロセスレベルの報酬をステップワイドガイダンスに取り入れ、複雑なマルチモーダルタスクにおけるトレーニング効率を向上させる。 MMIEとInterleavedBenchの実験により,マルチモーダルインターリーブド生成の品質とコヒーレンスを大幅に向上させることが実証された。

論文の概要: Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

関連論文リスト