EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts
- URL: http://arxiv.org/abs/2406.09162v1
- Date: Thu, 13 Jun 2024 14:26:43 GMT
- Title: EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts
- Authors: Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, Hanwang Zhang,
- Abstract summary: EMMA is a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA.
By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts.
- Score: 48.214475133206385
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.
Related papers
- MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.
We introduce Ominiponent Supervised Finetuning, transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing [23.51498634405422]
We present an innovative image editing framework that employs the robust Chain-of-Thought reasoning and localizing capabilities of multimodal large language models.
Our model exhibits an enhanced ability to understand complex prompts and generate corresponding images, while maintaining high fidelity and consistency in images before and after generation.
arXiv Detail & Related papers (2024-05-27T03:50:37Z) - MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation [22.69019130782004]
We present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities.
We train MoMA to serve a dual role as both a feature extractor and a generator.
We introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model.
arXiv Detail & Related papers (2024-04-08T16:55:49Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.