Data Efficient Masked Language Modeling for Vision and Language
- URL: http://arxiv.org/abs/2109.02040v1
- Date: Sun, 5 Sep 2021 11:27:53 GMT
- Title: Data Efficient Masked Language Modeling for Vision and Language
- Authors: Yonatan Bitton, Gabriel Stanovsky, Michael Elhadad, Roy Schwartz
- Abstract summary: Masked language modeling (MLM) is one of the key sub-tasks in vision-language training.
In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text.
We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
- Score: 16.95631509102115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked language modeling (MLM) is one of the key sub-tasks in vision-language
pretraining. In the cross-modal setting, tokens in the sentence are masked at
random, and the model predicts the masked tokens given the image and the text.
In this paper, we observe several key disadvantages of MLM in this setting.
First, as captions tend to be short, in a third of the sentences no token is
sampled. Second, the majority of masked tokens are stop-words and punctuation,
leading to under-utilization of the image. We investigate a range of
alternative masking strategies specific to the cross-modal setting that address
these shortcomings, aiming for better fusion of text and image in the learned
representation. When pre-training the LXMERT model, our alternative masking
strategies consistently improve over the original masking strategy on three
downstream tasks, especially in low resource settings. Further, our
pre-training approach substantially outperforms the baseline model on a
prompt-based probing task designed to elicit image objects. These results and
our analysis indicate that our method allows for better utilization of the
training data.
Related papers
- SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining [2.9010546489056415]
Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets.
In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text.
We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
arXiv Detail & Related papers (2024-04-01T15:01:38Z) - BiLMa: Bidirectional Local-Matching for Text-based Person
Re-identification [2.3931689873603603]
Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query.
How to effectively align images and texts globally and locally is a crucial challenge.
We introduce Bidirectional Local-Matching (LMa) framework that jointly optimize Masked Image Modeling (MIM) in TBPReID model training.
arXiv Detail & Related papers (2023-09-09T04:01:24Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.