mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration
- URL: http://arxiv.org/abs/2311.04257v2
- Date: Thu, 9 Nov 2023 01:56:51 GMT
- Title: mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration
- Authors: Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi
Qian, Ji Zhang, Fei Huang, Jingren Zhou
- Abstract summary: mPLUG-Owl2 is a versatile multi-modal large language model.
It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
- Score: 74.31268379055201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However, previous
methods primarily focus on enhancing multi-modal capabilities. In this work, we
introduce a versatile multi-modal large language model, mPLUG-Owl2, which
effectively leverages modality collaboration to improve performance in both
text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design,
with the language decoder acting as a universal interface for managing
different modalities. Specifically, mPLUG-Owl2 incorporates shared functional
modules to facilitate modality collaboration and introduces a modality-adaptive
module that preserves modality-specific features. Extensive experiments reveal
that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal
tasks and achieving state-of-the-art performances with a single generic model.
Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality
collaboration phenomenon in both pure-text and multi-modal scenarios, setting a
pioneering path in the development of future multi-modal foundation models.
Related papers
- Ocean-omni: To Understand the World with Omni-modality [28.306965534325904]
Ocean-omni is the first open-source 7B Multimodal Large Language Model (MLLM)
We introduce Ocean-omni, the first open-source 7B Multimodal Large Language Model (MLLM)
arXiv Detail & Related papers (2024-10-11T06:44:31Z) - mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models [71.40705814904898]
We introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding.
Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space.
arXiv Detail & Related papers (2024-08-09T03:25:42Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.