CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
- URL: http://arxiv.org/abs/2311.18775v1
- Date: Thu, 30 Nov 2023 18:21:25 GMT
- Title: CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
- Authors: Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu,
Mohit Bansal
- Abstract summary: CoDi-2 is a versatile and interactive Multimodal Large Language Model (MLLM)
It can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc.
- Score: 88.33780780220091
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present CoDi-2, a versatile and interactive Multimodal Large Language
Model (MLLM) that can follow complex multimodal interleaved instructions,
conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any
input-output modality paradigm. By aligning modalities with language for both
encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not
only understand complex modality-interleaved instructions and in-context
examples, but also autoregressively generate grounded and coherent multimodal
outputs in the continuous feature space. To train CoDi-2, we build a
large-scale generation dataset encompassing in-context multimodal instructions
across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot
capabilities for multimodal generation, such as in-context learning, reasoning,
and compositionality of any-to-any modality generation through multi-round
interactive conversation. CoDi-2 surpasses previous domain-specific models on
tasks such as subject-driven image generation, vision transformation, and audio
editing. CoDi-2 signifies a substantial breakthrough in developing a
comprehensive multimodal foundation model adept at interpreting in-context
language-vision-audio interleaved instructions and producing multimodal
outputs.
Related papers
- MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration [74.31268379055201]
mPLUG-Owl2 is a versatile multi-modal large language model.
It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
arXiv Detail & Related papers (2023-11-07T14:21:29Z) - Kosmos-2: Grounding Multimodal Large Language Models to the World [107.27280175398089]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM)
It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world.
Code and pretrained models are available at https://aka.ms/kosmos-2.
arXiv Detail & Related papers (2023-06-26T16:32:47Z) - ChatBridge: Bridging Modalities with Large Language Model as a Language
Catalyst [24.517389691825667]
ChatBridge is a novel multimodal language model that leverages the expressive capabilities of language to bridge the gap between various modalities.
All codes, data, and models of ChatBridge will be open-sourced.
arXiv Detail & Related papers (2023-05-25T14:34:08Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.