MLIM: Vision-and-Language Model Pre-training with Masked Language and
Image Modeling
- URL: http://arxiv.org/abs/2109.12178v1
- Date: Fri, 24 Sep 2021 20:25:40 GMT
- Title: MLIM: Vision-and-Language Model Pre-training with Masked Language and
Image Modeling
- Authors: Tarik Arici, Mehmet Saygin Seyfioglu, Tal Neiman, Yi Xu, Son Train,
Trishul Chilimbi, Belinda Zeng, and Ismail Tutar
- Abstract summary: Masked Language and Image Modeling (MLIM) uses two loss functions: Masked Language Modeling (MLM) loss and image reconstruction (RECON) loss.
We propose Modality Aware Masking (MAM) to boost cross-modality interaction.
- Score: 14.563358764946498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Pre-training (VLP) improves model performance for
downstream tasks that require image and text inputs. Current VLP approaches
differ on (i) model architecture (especially image embedders), (ii) loss
functions, and (iii) masking policies. Image embedders are either deep models
like ResNet or linear projections that directly feed image-pixels into the
transformer. Typically, in addition to the Masked Language Modeling (MLM) loss,
alignment-based objectives are used for cross-modality interaction, and RoI
feature regression and classification tasks for Masked Image-Region Modeling
(MIRM). Both alignment and MIRM objectives mostly do not have ground truth.
Alignment-based objectives require pairings of image and text and heuristic
objective functions. MIRM relies on object detectors. Masking policies either
do not take advantage of multi-modality or are strictly coupled with alignments
generated by other models. In this paper, we present Masked Language and Image
Modeling (MLIM) for VLP. MLIM uses two loss functions: Masked Language Modeling
(MLM) loss and image reconstruction (RECON) loss. We propose Modality Aware
Masking (MAM) to boost cross-modality interaction and take advantage of MLM and
RECON losses that separately capture text and image reconstruction quality.
Using MLM + RECON tasks coupled with MAM, we present a simplified VLP
methodology and show that it has better downstream task performance on a
proprietary e-commerce multi-modal dataset.
Related papers
- OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling [80.85164509232261]
We propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer.
To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM)
Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region.
arXiv Detail & Related papers (2024-10-10T15:18:19Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - Masked and Permuted Implicit Context Learning for Scene Text Recognition [8.742571493814326]
Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds.
We propose a masked and permuted implicit context learning network for STR, within a single decoder.
arXiv Detail & Related papers (2023-05-25T15:31:02Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - Improving Cross-modal Alignment for Text-Guided Image Inpainting [36.1319565907582]
Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image.
We propose a novel model for TGII by improving cross-modal alignment.
Our model achieves state-of-the-art performance compared with other strong competitors.
arXiv Detail & Related papers (2023-01-26T19:18:27Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.