X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
- URL: http://arxiv.org/abs/2503.06134v2
- Date: Sat, 22 Mar 2025 07:28:50 GMT
- Title: X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
- Authors: Jian Ma, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu, Zhenyu Yang,
- Abstract summary: We propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities.<n>X2I shows a decrease in performance degradation of less than 1% while gaining various multimodal understanding abilities.
- Score: 7.61087111021017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1\% while gaining various multimodal understanding abilities, including multilingual to image, image to image, image-text to image, video to image, audio to image, and utilizing creative fusion to enhance imagery. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, multifunctionality, and transferability of our X2I. The open-source code and checkpoints for X2I can be found at the following link: https://github.com/OPPO-Mente-Lab/X2I.
Related papers
- ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.
ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.
We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following [26.457571615782985]
InstructAny2Pix is a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text.
We demonstrate that our system can perform a series of novel instruction-guided editing tasks.
arXiv Detail & Related papers (2023-12-11T17:53:45Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - AltDiffusion: A Multilingual Text-to-Image Diffusion Model [4.534546889526814]
We present AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages.
Specifically, we first train a multilingual text encoder based on the knowledge distillation.
Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability.
arXiv Detail & Related papers (2023-08-19T11:52:12Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.