Any-to-Any Generation via Composable Diffusion
- URL: http://arxiv.org/abs/2305.11846v1
- Date: Fri, 19 May 2023 17:38:32 GMT
- Title: Any-to-Any Generation via Composable Diffusion
- Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal
- Abstract summary: Composable Diffusion (CoDi) is a novel generative model capable of generating any combination of output modalities.
CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image.
Highly customizable and flexible, CoDi achieves strong joint-modality generation quality.
- Score: 111.94094932032205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image,
video, or audio, from any combination of input modalities. Unlike existing
generative AI systems, CoDi can generate multiple modalities in parallel and
its input is not limited to a subset of modalities like text or image. Despite
the absence of training datasets for many combinations of modalities, we
propose to align modalities in both the input and output space. This allows
CoDi to freely condition on any input combination and generate any group of
modalities, even if they are not present in the training data. CoDi employs a
novel composable generation strategy which involves building a shared
multimodal space by bridging alignment in the diffusion process, enabling the
synchronized generation of intertwined modalities, such as temporally aligned
video and audio. Highly customizable and flexible, CoDi achieves strong
joint-modality generation quality, and outperforms or is on par with the
unimodal state-of-the-art for single-modality synthesis. The project page with
demonstrations and code is at https://codi-gen.github.io
Related papers
- Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities [6.9522425458326635]
We propose a multi-tower decoder architecture that flexibly composes multimodal generative models from independently pre-trained unimodal decoders.
We show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data.
In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.
arXiv Detail & Related papers (2024-05-29T00:23:55Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation [88.33780780220091]
CoDi-2 is a versatile and interactive Multimodal Large Language Model (MLLM)
It can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc.
arXiv Detail & Related papers (2023-11-30T18:21:25Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality
Hybrid [40.745848169903105]
Multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge graphs.
MMEA algorithms rely on KG-level modality fusion strategies for multi-modal entity representation.
This paper introduces MEAformer, a multi-modal entity alignment transformer approach for meta modality hybrid.
arXiv Detail & Related papers (2022-12-29T20:49:58Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.