Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning
- URL: http://arxiv.org/abs/2211.13437v2
- Date: Sun, 26 Mar 2023 13:59:36 GMT
- Title: Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning
- Authors: Yatai Ji, Rongcheng Tu, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe
Zhao, Hongfa Wang, Yujiu Yang, Wei Liu
- Abstract summary: Cross-modal alignment is essential for vision-language pre-training models.
We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment.
We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
- Score: 22.464424641734652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal alignment is essential for vision-language pre-training (VLP)
models to learn the correct corresponding information across different
modalities. For this purpose, inspired by the success of masked language
modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling
tasks have been proposed for VLP to further promote cross-modal interactions.
The core idea of previous masked modeling tasks is to focus on reconstructing
the masked tokens based on visible context for learning local-to-local
alignment. However, most of them pay little attention to the global semantic
features generated for the masked data, resulting in a limited cross-modal
alignment ability of global representations. Therefore, in this paper, we
propose a novel Semantic Completion Learning (SCL) task, complementary to
existing masked modeling tasks, to facilitate global-to-local alignment.
Specifically, the SCL task complements the missing semantics of masked data by
capturing the corresponding information from the other modality, promoting
learning more representative global features which have a great impact on the
performance of downstream tasks. Moreover, we present a flexible vision
encoder, which enables our model to perform image-text and video-text
multimodal tasks simultaneously. Experimental results show that our proposed
method obtains state-of-the-art performance on various vision-language
benchmarks, such as visual question answering, image-text retrieval, and
video-text retrieval.
Related papers
- Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Global and Local Semantic Completion Learning for Vision-Language
Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models.
We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z) - MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.