EVE: Efficient Vision-Language Pre-training with Masked Prediction and
Modality-Aware MoE
- URL: http://arxiv.org/abs/2308.11971v2
- Date: Fri, 1 Mar 2024 11:22:54 GMT
- Title: EVE: Efficient Vision-Language Pre-training with Masked Prediction and
Modality-Aware MoE
- Authors: Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin,
Dongyu Zhang
- Abstract summary: Efficient Vision-languagE is one unified multimodal Transformer pre-trained solely by one unified pre-training task.
Eve encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts.
Eve achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
- Score: 66.48689706116808
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building scalable vision-language models to learn from diverse, multimodal
data remains an open challenge. In this paper, we introduce an Efficient
Vision-languagE foundation model, namely EVE, which is one unified multimodal
Transformer pre-trained solely by one unified pre-training task. Specifically,
EVE encodes both vision and language within a shared Transformer network
integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which
capture modality-specific information by selectively switching to different
experts. To unify pre-training tasks of vision and language, EVE performs
masked signal modeling on image-text pairs to reconstruct masked signals, i.e.,
image pixels and text tokens, given visible signals. This simple yet effective
pre-training objective accelerates training by 3.5x compared to the model
pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing
to the combination of the unified architecture and pre-training task, EVE is
easy to scale up, enabling better downstream performance with fewer resources
and faster training speed. Despite its simplicity, EVE achieves
state-of-the-art performance on various vision-language downstream tasks,
including visual question answering, visual reasoning, and image-text
retrieval.
Related papers
- Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.