CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language
- URL: http://arxiv.org/abs/2304.04399v1
- Date: Mon, 10 Apr 2023 05:54:03 GMT
- Title: CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language
- Authors: Shentong Mo, Jingfei Xia, Ihor Markevych
- Abstract summary: Visual and linguistic pre-training aims to learn vision and language representations together.
Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks.
We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
- Score: 10.57079240576682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual and linguistic pre-training aims to learn vision and language
representations together, which can be transferred to visual-linguistic
downstream tasks. However, there exists semantic confusion between language and
vision during the pre-training stage. Moreover, current pre-trained models tend
to take lots of computation resources for fine-tuning when transferred to
downstream tasks. In this work, we present a simple but effective approach for
learning Contrastive and Adaptive representations of Vision and Language,
namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn
alignments between the whole sentence and each image in the same batch during
the pre-training process. At the fine-tuning stage, we introduce two
lightweight adaptation networks to reduce model parameters and increase
training speed for saving computation resources. We evaluate our CAVL on six
main downstream tasks, including Visual Question Answering (VQA), Visual
Commonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR),
Region-to-Phrase Grounding (RPG), Text-to-Image Retrieval (TIR), and Zero-shot
Text-to-Image Retrieval (ZS-TIR). Compared to baselines, we achieve superior
performance and reduce the fine-tuning time by a large margin (in particular,
76.17%). Extensive experiments and ablation studies demonstrate the efficiency
of contrastive pre-training and adaptive fine-tuning proposed in our CAVL.
Related papers
- NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training [6.34265125858783]
We propose a noise-robust framework for efficient vision-language pre-training that requires less pre-training data.
Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer.
We introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning.
arXiv Detail & Related papers (2024-09-15T01:54:17Z) - Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones.
We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction.
Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Contrastive Vision-Language Alignment Makes Efficient Instruction
Learner [31.281236193979165]
We study the task of extending the large language model (LLM) into a vision-language instruction-following model.
Existing methods typically train a visual adapter to align the representation between a pre-trained vision transformer (ViT) and the LLM by a generative image captioning loss.
We propose CG-VLM that applies Contrastive and Generative alignment objectives to effectively align the representation of ViT and LLM.
arXiv Detail & Related papers (2023-11-29T03:29:46Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Contrastive Visual-Linguistic Pretraining [48.88553854384866]
Contrastive Visual-Linguistic Pretraining constructs a visual self-supervised loss built upon contrastive learning.
We evaluate it on several down-stream tasks, including VQA, GQA and NLVR2.
arXiv Detail & Related papers (2020-07-26T14:26:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.