mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections
- URL: http://arxiv.org/abs/2205.12005v2
- Date: Wed, 25 May 2022 05:49:30 GMT
- Title: mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections
- Authors: Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi,
Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang, Fei
Huang, Jingren Zhou, Luo Si
- Abstract summary: mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
- Score: 104.14624185375897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pretrained foundation models have been an emerging paradigm for
building artificial intelligence (AI) systems, which can be quickly adapted to
a wide range of downstream tasks. This paper presents mPLUG, a new
vision-language foundation model for both cross-modal understanding and
generation. Most existing pre-trained models suffer from the problems of low
computational efficiency and information asymmetry brought by the long visual
sequence in cross-modal alignment. To address these problems, mPLUG introduces
an effective and efficient vision-language architecture with novel cross-modal
skip-connections, which creates inter-layer shortcuts that skip a certain
number of layers for time-consuming full self-attention on the vision side.
mPLUG is pre-trained end-to-end on large-scale image-text pairs with both
discriminative and generative objectives. It achieves state-of-the-art results
on a wide range of vision-language downstream tasks, such as image captioning,
image-text retrieval, visual grounding and visual question answering. mPLUG
also demonstrates strong zero-shot transferability when directly transferred to
multiple video-language tasks.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - EVE: Efficient Vision-Language Pre-training with Masked Prediction and
Modality-Aware MoE [66.48689706116808]
Efficient Vision-languagE is one unified multimodal Transformer pre-trained solely by one unified pre-training task.
Eve encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts.
Eve achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
arXiv Detail & Related papers (2023-08-23T07:36:30Z) - MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.