Fugu-MT 論文翻訳(概要): MMaDA: Multimodal Large Diffusion Language Models

論文の概要: MMaDA: Multimodal Large Diffusion Language Models

arxiv url: http://arxiv.org/abs/2505.15809v1
Date: Wed, 21 May 2025 17:59:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-22 15:42:59.825936
Title: MMaDA: Multimodal Large Diffusion Language Models
Title（参考訳）: MMaDA:マルチモーダル大拡散言語モデル
Authors: Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang,
Abstract要約: マルチモーダル拡散基礎モデルの新たなクラスであるMMaDAを紹介する。テキスト推論、マルチモーダル理解、テキスト・ツー・イメージ生成など、さまざまな領域で優れたパフォーマンスを実現するように設計されている。
参考スコア（独自算出の注目度）: 47.043301822171195
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA
Abstract（参考訳）: MMaDAは,テキスト推論,マルチモーダル理解,テキスト・ツー・イメージ生成など,多分野にわたる優れたパフォーマンスを実現するために設計された,新しい多モーダル拡散基盤モデルである。このアプローチは3つの重要なイノベーションによって区別される。 (i) MMaDAは、共用確率的定式化とモダリティに依存しない設計を備えた統一拡散アーキテクチャを採用し、モダリティ固有のコンポーネントの必要性を排除している。このアーキテクチャは、異なるデータタイプ間のシームレスな統合と処理を保証する。 (II) モダリティ間で統一されたCoTフォーマットをキュレートする混合長チェーン・オブ・シンクレット(CoT)ファインチューニング戦略を実装した。テキストドメインと視覚ドメイン間の推論プロセスの整合化により、この戦略は最終強化学習(RL)段階におけるコールドスタートトレーニングを促進し、モデルが複雑なタスクを最初から処理する能力を向上させる。 3) 拡散基盤モデルに特化して最適化された統一ポリシー勾配型RLアルゴリズムUniGRPOを提案する。多様な報酬モデリングを利用することで、UniGRPOは推論タスクと生成タスクの両方にわたるポストトレーニングを統一し、一貫したパフォーマンス改善を保証する。実験により,MMaDA-8Bは統合マルチモーダル基礎モデルとして強い一般化能力を示すことが示された。テキスト推論ではLLaMA-3-7BやQwen2-7Bといった強力なモデルを超え、マルチモーダル理解ではShow-oやSEED-Xより優れ、テキスト画像生成ではSDXLやJanusよりも優れている。これらの成果は、統合拡散アーキテクチャにおける事前訓練と後訓練のギャップを埋めることにおけるMMaDAの有効性を強調し、将来の研究開発のための包括的なフレームワークを提供する。私たちはコードとトレーニングされたモデルを、https://github.com/Gen-Verse/MMaDAでオープンソース化しました。

論文の概要: MMaDA: Multimodal Large Diffusion Language Models

関連論文リスト