Fugu-MT 論文翻訳(概要): MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

論文の概要: MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

arxiv url: http://arxiv.org/abs/2503.01298v1
Date: Mon, 03 Mar 2025 08:36:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-03-05 18:50:37.894812
Title: MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation
Title（参考訳）: MINT:画像生成のための統一生成モデルにおける思考のマルチモーダルチェイン
Authors: Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu, Haoyuan Li, Weilong Dai, Mingli Song, Jie Song, Hao Jiang,
Abstract要約: 画像生成の強化のために,MINTを導入し,マルチモーダル・シンキング・オブ・シンキング (MCoT) を生かした,革新的統一的生成モデルを提案する。本稿では,MCoT学習パラダイムを提案する。このパラダイムは,画像生成に特化して設計されたマルチモーダル思考,推論,リフレクションに対するステップバイステップアプローチである。 MINTは、テキスト・トゥ・イメージ(T2I)と画像・トゥ・テキスト(I2T)タスクの複数のベンチマークで優れたパフォーマンスを示すことが検証されている。
参考スコア（独自算出の注目度）: 38.517814177255765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
Abstract（参考訳）: 統一生成モデルは、テキスト生成と画像生成の両方において異常な性能を示した。しかし,テキスト・ツー・イメージ生成にのみ依存する難しさから,複雑な画像を生成する際には性能が低下する傾向にある。この課題に対応するために,MINTは,画像生成の強化のために,ネイティブなマルチモーダル・チェーン・オブ・シンキング(MCoT)を応用した,革新的な統一的生成モデルである。まず、自然言語生成(NLG)と視覚能力の両方を効果的にサポートする専門家並列構造であるMTXpertを設計する。そこで本研究では,MCoT学習パラダイムを提案する。このパラダイムは,画像生成に特化して設計されたマルチモーダル思考,推論,リフレクションに対するステップバイステップアプローチである。このパラダイムは、MINTにニュアンスがあり、要素的に分離されたアライメントと、テキストおよびビジュアルコンポーネントの包括的な理解を提供する。さらに、高度なマルチモーダル推論と自己回帰を促進し、これらの要素間の論理的関係にしっかりと根ざした画像の構築を可能にする。特に、MINTは、テキスト・トゥ・イメージ(T2I)と画像・トゥ・テキスト(I2T)タスクの複数のベンチマークで優れたパフォーマンスを示すことが検証されている。

論文の概要: MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

関連論文リスト