MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning
- URL: http://arxiv.org/abs/2210.04183v3
- Date: Wed, 14 Jun 2023 07:26:20 GMT
- Title: MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning
- Authors: Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing
Liu
- Abstract summary: We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
- Score: 23.45678557013005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal representation learning has shown promising improvements on
various vision-language tasks. Most existing methods excel at building
global-level alignment between vision and language while lacking effective
fine-grained image-text interaction. In this paper, we propose a jointly masked
multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both
implicit and explicit targets for the masked signals to recover. The implicit
target provides a unified and debiased objective for vision and language, where
the model predicts latent multimodal representations of the unmasked input. The
explicit target further enriches the multimodal representations by recovering
high-level and semantically meaningful information: momentum visual features of
image patches and concepts of word tokens. Through such a masked modeling
process, our model not only learns fine-grained multimodal interaction, but
also avoids the semantic gap between high-level representations and low- or
mid-level prediction targets (e.g. image pixels), thus producing semantically
rich multimodal representations that perform well on both zero-shot and
fine-tuned settings. Our pre-trained model (named MAMO) achieves
state-of-the-art performance on various downstream vision-language tasks,
including image-text retrieval, visual question answering, visual reasoning,
and weakly-supervised visual grounding.
Related papers
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - ViLTA: Enhancing Vision-Language Pre-training through Textual
Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs.
For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model.
For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning [22.464424641734652]
Cross-modal alignment is essential for vision-language pre-training models.
We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment.
We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
arXiv Detail & Related papers (2022-11-24T06:39:16Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.