VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts
- URL: http://arxiv.org/abs/2111.02358v1
- Date: Wed, 3 Nov 2021 17:20:36 GMT
- Title: VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts
- Authors: Wenhui Wang, Hangbo Bao, Li Dong, Furu Wei
- Abstract summary: We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks.
We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
- Score: 46.55920956687346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a unified Vision-Language pretrained Model (VLMo) that jointly
learns a dual encoder and a fusion encoder with a modular Transformer network.
Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer,
where each block contains a pool of modality-specific experts and a shared
self-attention layer. Because of the modeling flexibility of MoME, pretrained
VLMo can be fine-tuned as a fusion encoder for vision-language classification
tasks, or used as a dual encoder for efficient image-text retrieval. Moreover,
we propose a stagewise pre-training strategy, which effectively leverages
large-scale image-only and text-only data besides image-text pairs.
Experimental results show that VLMo achieves state-of-the-art results on
various vision-language tasks, including VQA and NLVR2. The code and pretrained
models are available at https://aka.ms/vlmo.
Related papers
- HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE.
It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts.
Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z) - Optimizing Vision-Language Interactions Through Decoder-Only Models [4.219163079329444]
MUDAIF is a vision-language model that seamlessly integrates visual and textual inputs.
It achieves enhanced efficiency, flexibility, and cross-modal understanding.
It is trained on a large-scale dataset of 45M image-text pairs.
arXiv Detail & Related papers (2024-12-14T09:04:32Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.