MLIM: Vision-and-Language Model Pre-training with Masked Language and
Image Modeling
- URL: http://arxiv.org/abs/2109.12178v1
- Date: Fri, 24 Sep 2021 20:25:40 GMT
- Title: MLIM: Vision-and-Language Model Pre-training with Masked Language and
Image Modeling
- Authors: Tarik Arici, Mehmet Saygin Seyfioglu, Tal Neiman, Yi Xu, Son Train,
Trishul Chilimbi, Belinda Zeng, and Ismail Tutar
- Abstract summary: Masked Language and Image Modeling (MLIM) uses two loss functions: Masked Language Modeling (MLM) loss and image reconstruction (RECON) loss.
We propose Modality Aware Masking (MAM) to boost cross-modality interaction.
- Score: 14.563358764946498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Pre-training (VLP) improves model performance for
downstream tasks that require image and text inputs. Current VLP approaches
differ on (i) model architecture (especially image embedders), (ii) loss
functions, and (iii) masking policies. Image embedders are either deep models
like ResNet or linear projections that directly feed image-pixels into the
transformer. Typically, in addition to the Masked Language Modeling (MLM) loss,
alignment-based objectives are used for cross-modality interaction, and RoI
feature regression and classification tasks for Masked Image-Region Modeling
(MIRM). Both alignment and MIRM objectives mostly do not have ground truth.
Alignment-based objectives require pairings of image and text and heuristic
objective functions. MIRM relies on object detectors. Masking policies either
do not take advantage of multi-modality or are strictly coupled with alignments
generated by other models. In this paper, we present Masked Language and Image
Modeling (MLIM) for VLP. MLIM uses two loss functions: Masked Language Modeling
(MLM) loss and image reconstruction (RECON) loss. We propose Modality Aware
Masking (MAM) to boost cross-modality interaction and take advantage of MLM and
RECON losses that separately capture text and image reconstruction quality.
Using MLM + RECON tasks coupled with MAM, we present a simplified VLP
methodology and show that it has better downstream task performance on a
proprietary e-commerce multi-modal dataset.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs [23.011836329934255]
Vision Dynamic Embedding-Guided Pretraining (VDEP) is a hybrid autoregressive training paradigm for MLLMs.
The proposed method seamlessly integrates into standard models without architectural changes.
Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
arXiv Detail & Related papers (2025-02-13T09:04:28Z) - OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling [80.85164509232261]
We propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer.
To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM)
Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region.
arXiv Detail & Related papers (2024-10-10T15:18:19Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - Masked and Permuted Implicit Context Learning for Scene Text Recognition [8.742571493814326]
Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds.
We propose a masked and permuted implicit context learning network for STR, within a single decoder.
arXiv Detail & Related papers (2023-05-25T15:31:02Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.