MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal
Image Generation
- URL: http://arxiv.org/abs/2305.15296v3
- Date: Wed, 20 Dec 2023 18:52:00 GMT
- Title: MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal
Image Generation
- Authors: Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich,
Bj\"orn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock,
Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick
Schramowski, Kristian Kersting, Samuel Weinbach
- Abstract summary: MultiFusion allows one to express complex concepts with arbitrarily interleaved inputs of multiple modalities and languages.
MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system.
- Score: 21.455774034659978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent popularity of text-to-image diffusion models (DM) can largely be
attributed to the intuitive interface they provide to users. The intended
generation can be expressed in natural language, with the model producing
faithful interpretations of text prompts. However, expressing complex or
nuanced ideas in text alone can be difficult. To ease image generation, we
propose MultiFusion that allows one to express complex and nuanced concepts
with arbitrarily interleaved inputs of multiple modalities and languages.
MutliFusion leverages pre-trained models and aligns them for integration into a
cohesive system, thereby avoiding the need for extensive training from scratch.
Our experimental results demonstrate the efficient transfer of capabilities
from individual modules to the downstream model. Specifically, the fusion of
all independent components allows the image generation module to utilize
multilingual, interleaved multimodal inputs despite being trained solely on
monomodal data in a single language.
Related papers
- AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.