Fugu-MT 論文翻訳(概要): UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

論文の概要: UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

arxiv url: http://arxiv.org/abs/2606.16255v1
Date: Mon, 15 Jun 2026 05:57:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.103618
Title: UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer
Title（参考訳）: UniDDT: 分離拡散変換器によるマルチモーダル理解と生成の統合
Authors: Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang,
Abstract要約: 統一マルチモーダルモデル(UMM)は汎用マルチモーダルインテリジェンスにとって重要な方向として現れている。既存のUMMは,(1)視覚的理解と生成タスクの間に固有の学習の衝突が生じ,両者のタスクが最適でないモデリングに繋がる,(2)異なる理解と生成の空間がスケーラビリティを妨げる,(3)テキスト・イメージ的理解と生成の双対性を無視したタスク固有のデータへの過度な依存,といった課題に直面している。
参考スコア（独自算出の注目度）: 29.975180930024067
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.
Abstract（参考訳）: 統一マルチモーダルモデル(UMM)は、汎用マルチモーダルインテリジェンスにとって重要な方向として現れ、理解と生成を単一のフレームワークに統合している。しかし,既存のUMMでは,(1)視覚的理解と生成タスク間の固有の学習の相違が両タスクの最適部分モデリングに繋がる,(2)異なる理解と生成の空間がスケーラビリティを妨げる,(3)テキスト画像理解と生成の双対性を無視したタスク固有のデータへの過度な依存など,大きな課題に直面している。これらの課題に対処するため、我々は、UniDDTを提案する。これは、視覚生成と理解タスクのためのセマンティックエンコーディングを統合するために、LLMとともにノイズ ViTエンコーダを活用し、テキストデコーディングから拡散デコードを切り離すために、分離された拡散デコーダを用いる。このノイズの多いViTエンコーダにより、UniDDTは潜在空間を統一された視覚表現として活用し、理解と生成タスク間のシームレスな互換性を実現する。したがって、生成タスク内のスケーラビリティと理解タスク内の意味表現性はバランスをとることができる。また、同じ画像とテキストのペアから2つのデータ構造を構築し、生成と理解データ間の相互依存を育み、それら固有の双対性を利用する。拡張されたセマンティック一貫性と拡張性を備えたマルチモーダル理解と生成を,UniDDTが効果的に統合できることを実証した。視覚生成タスクにおいて、我々のUniDDTは0.87 GenEvalスコアと86.9 DPGスコアを達成した。マルチモーダル理解タスクでは,MMEベンチマークで1699.5,SEEDbenchで76.5を達成している。

論文の概要: UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

関連論文リスト