mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video
- URL: http://arxiv.org/abs/2302.00402v1
- Date: Wed, 1 Feb 2023 12:40:03 GMT
- Title: mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video
- Authors: Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu,
Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang,
Fei Huang, Jingren Zhou
- Abstract summary: mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
- Score: 89.19867891570945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed a big convergence of language, vision, and
multi-modal pretraining. In this work, we present mPLUG-2, a new unified
paradigm with modularized design for multi-modal pretraining, which can benefit
from modality collaboration while addressing the problem of modality
entanglement. In contrast to predominant paradigms of solely relying on
sequence-to-sequence generation or encoder-based instance discrimination,
mPLUG-2 introduces a multi-module composition network by sharing common
universal modules for modality collaboration and disentangling different
modality modules to deal with modality entanglement. It is flexible to select
different modules for different understanding and generation tasks across all
modalities including text, image, and video. Empirical study shows that mPLUG-2
achieves state-of-the-art or competitive results on a broad range of over 30
downstream tasks, spanning multi-modal tasks of image-text and video-text
understanding and generation, and uni-modal tasks of text-only, image-only, and
video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results
of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and
video caption tasks with a far smaller model size and data scale. It also
demonstrates strong zero-shot transferability on vision-language and
video-language tasks. Code and models will be released in
https://github.com/alibaba/AliceMind.
Related papers
- Everything is a Video: Unifying Modalities through Next-Frame Prediction [5.720266474212221]
We introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning.
We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components.
Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text.
arXiv Detail & Related papers (2024-11-15T12:59:37Z) - MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration [74.31268379055201]
mPLUG-Owl2 is a versatile multi-modal large language model.
It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
arXiv Detail & Related papers (2023-11-07T14:21:29Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.