Leveraging per Image-Token Consistency for Vision-Language Pre-training
- URL: http://arxiv.org/abs/2211.15398v2
- Date: Sat, 2 Sep 2023 08:42:04 GMT
- Title: Leveraging per Image-Token Consistency for Vision-Language Pre-training
- Authors: Yunhao Gou, Tom Ko, Hansi Yang, James Kwok, Yu Zhang, Mingxuan Wang
- Abstract summary: Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
- Score: 52.825150269820696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing vision-language pre-training (VLP) approaches adopt cross-modal
masked language modeling (CMLM) to learn vision-language associations. However,
we find that CMLM is insufficient for this purpose according to our
observations: (1) Modality bias: a considerable amount of masked tokens in CMLM
can be recovered with only the language information, ignoring the visual
inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on
the masked token but it cannot simultaneously leverage other tokens to learn
vision-language associations. To handle those limitations, we propose EPIC
(lEveraging Per Image-Token Consistency for vision-language pre-training). In
EPIC, for each image-sentence pair, we mask tokens that are salient to the
image (i.e., Saliency-based Masking Strategy) and replace them with
alternatives sampled from a language model (i.e., Inconsistent Token Generation
Procedure), and then the model is required to determine for each token in the
sentence whether it is consistent with the image (i.e., Image-Token Consistency
Task). The proposed EPIC method is easily combined with pre-training methods.
Extensive experiments show that the combination of the EPIC method and
state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and
X-VLM, leads to significant improvements on downstream tasks. The code is
released at https://github.com/gyhdog99/epic.
Related papers
- Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - Contrastive Vision-Language Alignment Makes Efficient Instruction
Learner [31.281236193979165]
We study the task of extending the large language model (LLM) into a vision-language instruction-following model.
Existing methods typically train a visual adapter to align the representation between a pre-trained vision transformer (ViT) and the LLM by a generative image captioning loss.
We propose CG-VLM that applies Contrastive and Generative alignment objectives to effectively align the representation of ViT and LLM.
arXiv Detail & Related papers (2023-11-29T03:29:46Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Data Efficient Masked Language Modeling for Vision and Language [16.95631509102115]
Masked language modeling (MLM) is one of the key sub-tasks in vision-language training.
In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text.
We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
arXiv Detail & Related papers (2021-09-05T11:27:53Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.