Single-Stream Multi-Level Alignment for Vision-Language Pretraining
- URL: http://arxiv.org/abs/2203.14395v2
- Date: Wed, 30 Mar 2022 03:54:27 GMT
- Title: Single-Stream Multi-Level Alignment for Vision-Language Pretraining
- Authors: Zaid Khan, Vijay Kumar BG, Xiang Yu, Samuel Schulter, Manmohan
Chandraker, Yun Fu
- Abstract summary: We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
- Score: 103.09776737512078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in large-scale vision-language pre-training has shown the
importance of aligning the visual and text modalities for downstream
vision-language tasks. Many methods use a dual-stream architecture that fuses
visual tokens and language tokens after representation learning, which aligns
only at a global level and cannot extract finer-scale semantics. In contrast,
we propose a single stream model that aligns the modalities at multiple levels:
i) instance level, ii) fine-grained patch level, iii) conceptual semantic
level. We achieve this using two novel tasks: symmetric cross-modality
reconstruction and a pseudo-labeled key word prediction. In the former part, we
mask the input tokens from one of the modalities and use the cross-modal
information to reconstruct the masked token, thus improving fine-grained
alignment between the two modalities. In the latter part, we parse the caption
to select a few key words and feed it together with the momentum encoder pseudo
signal to self-supervise the visual encoder, enforcing it to learn rich
semantic concepts that are essential for grounding a textual token to an image
region. We demonstrate top performance on a set of Vision-Language downstream
tasks such as zero-shot/fine-tuned image/text retrieval, referring expression,
and VQA. We also demonstrate how the proposed models can align the modalities
at multiple levels.
Related papers
- Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - ViLTA: Enhancing Vision-Language Pre-training through Textual
Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs.
For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model.
For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic
Alignment [24.720485548282845]
We introduce concepts in both modalities to construct two-level semantic representations for language and vision.
We train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning.
Our model generates the-state-of-the-art results on several vision and language tasks.
arXiv Detail & Related papers (2022-01-29T14:30:59Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple
Levels [35.57369098866317]
Vision-language pre-training on large-scale image-text pairs has witnessed rapid progress for learning cross-modal representations.
We propose a new pre-training method which jointly aligns both the low-level and high-level semantics between image and text representations.
arXiv Detail & Related papers (2021-03-14T02:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.