Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training
- URL: http://arxiv.org/abs/2403.00249v1
- Date: Fri, 1 Mar 2024 03:25:58 GMT
- Title: Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training
- Authors: Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang
Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu
- Abstract summary: Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
- Score: 87.69394953339238
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In vision-language pre-training (VLP), masked image modeling (MIM) has
recently been introduced for fine-grained cross-modal alignment. However, in
most existing methods, the reconstruction targets for MIM lack high-level
semantics, and text is not sufficiently involved in masked modeling. These two
drawbacks limit the effect of MIM in facilitating cross-modal semantic
alignment. In this work, we propose a semantics-enhanced cross-modal MIM
framework (SemMIM) for vision-language representation learning. Specifically,
to provide more semantically meaningful supervision for MIM, we propose a local
semantics enhancing approach, which harvest high-level semantics from global
image features via self-supervised agreement learning and transfer them to
local patch encodings by sharing the encoding space. Moreover, to achieve deep
involvement of text during the entire MIM process, we propose a text-guided
masking strategy and devise an efficient way of injecting textual information
in both masked modeling and reconstruction target acquisition. Experimental
results validate that our method improves the effectiveness of the MIM task in
facilitating cross-modal semantic alignment. Compared to previous VLP models
with similar model size and data scale, our SemMIM model achieves
state-of-the-art or competitive performance on multiple downstream
vision-language tasks.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Global and Local Semantic Completion Learning for Vision-Language
Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models.
We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z) - Improving Cross-modal Alignment for Text-Guided Image Inpainting [36.1319565907582]
Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image.
We propose a novel model for TGII by improving cross-modal alignment.
Our model achieves state-of-the-art performance compared with other strong competitors.
arXiv Detail & Related papers (2023-01-26T19:18:27Z) - Masked Visual Reconstruction in Language Semantic Space [38.43966132249977]
Masked visual Reconstruction In Language semantic Space (RILS) pre-training framework is presented.
RILS transforms vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets.
Our method exhibits advanced transferability on downstream classification, detection, and segmentation.
arXiv Detail & Related papers (2023-01-17T15:32:59Z) - Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning [22.464424641734652]
Cross-modal alignment is essential for vision-language pre-training models.
We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment.
We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
arXiv Detail & Related papers (2022-11-24T06:39:16Z) - MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.