Related papers: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

URL: http://arxiv.org/abs/2111.02358v1
Date: Wed, 3 Nov 2021 17:20:36 GMT
Title: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Authors: Wenhui Wang, Hangbo Bao, Li Dong, Furu Wei
Abstract summary: We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks. We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
Score: 46.55920956687346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA and NLVR2. The code and pretrained models are available at https://aka.ms/vlmo.

Related papers

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing [2.0249250133493195]
Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. We propose MOVE (Mixture of Visions) to leverage multiple pre-trained encoders for specialized tasks.
arXiv Detail & Related papers (2025-02-21T11:05:30Z)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE. It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z)
Optimizing Vision-Language Interactions Through Decoder-Only Models [4.219163079329444]
MUDAIF is a vision-language model that seamlessly integrates visual and textual inputs. It achieves enhanced efficiency, flexibility, and cross-modal understanding. It is trained on a large-scale dataset of 45M image-text pairs.
arXiv Detail & Related papers (2024-12-14T09:04:32Z)
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images. It is the first multi-modal model trained with a recipe adapted from text-only language models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z)
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data. System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z)
MAGVLT: Masked Generative Vision-and-Language Transformer [15.796199345773879]
We explore a unified generative vision-and-language model that can produce both images and text sequences. We propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT) For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks.
arXiv Detail & Related papers (2023-03-21T21:49:39Z)
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining. It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z)
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.