MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation
- URL: http://arxiv.org/abs/2503.01298v1
- Date: Mon, 03 Mar 2025 08:36:16 GMT
- Title: MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation
- Authors: Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu, Haoyuan Li, Weilong Dai, Mingli Song, Jie Song, Hao Jiang,
- Abstract summary: We introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation.<n>We propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation.<n> MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
- Score: 38.517814177255765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
Related papers
- UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework.<n>Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.<n>Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.
We introduce Ominiponent Supervised Finetuning, transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts [48.214475133206385]
EMMA is a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA.
By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts.
arXiv Detail & Related papers (2024-06-13T14:26:43Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.