Diversifying Joint Vision-Language Tokenization Learning
- URL: http://arxiv.org/abs/2306.03421v2
- Date: Thu, 15 Jun 2023 20:43:45 GMT
- Title: Diversifying Joint Vision-Language Tokenization Learning
- Authors: Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova
- Abstract summary: Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering.
We propose joint vision-language representation learning by diversifying the tokenization learning process.
- Score: 51.82353485389527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building joint representations across images and text is an essential step
for tasks such as Visual Question Answering and Video Question Answering. In
this work, we find that the representations must not only jointly capture
features from both modalities but should also be diverse for better
generalization performance. To this end, we propose joint vision-language
representation learning by diversifying the tokenization learning process,
enabling tokens that are sufficiently disentangled from each other to be
learned from both modalities. We observe that our approach outperforms the
baseline models in a majority of settings and is competitive with
state-of-the-art methods.
Related papers
- Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - Unifying Vision-Language Representation Space with Single-tower
Transformer [29.604520441315135]
We train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner.
We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces.
arXiv Detail & Related papers (2022-11-21T02:34:21Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic
Alignment [24.720485548282845]
We introduce concepts in both modalities to construct two-level semantic representations for language and vision.
We train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning.
Our model generates the-state-of-the-art results on several vision and language tasks.
arXiv Detail & Related papers (2022-01-29T14:30:59Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Deep Partial Multi-View Learning [94.39367390062831]
We propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets)
We fifirst provide a formal defifinition of completeness and versatility for multi-view representation.
We then theoretically prove the versatility of the learned latent representations.
arXiv Detail & Related papers (2020-11-12T02:29:29Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.