Contrastive Visual-Linguistic Pretraining
- URL: http://arxiv.org/abs/2007.13135v1
- Date: Sun, 26 Jul 2020 14:26:18 GMT
- Title: Contrastive Visual-Linguistic Pretraining
- Authors: Lei Shi, Kai Shuang, Shijie Geng, Peng Su, Zhengkai Jiang, Peng Gao,
Zuohui Fu, Gerard de Melo, Sen Su
- Abstract summary: Contrastive Visual-Linguistic Pretraining constructs a visual self-supervised loss built upon contrastive learning.
We evaluate it on several down-stream tasks, including VQA, GQA and NLVR2.
- Score: 48.88553854384866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several multi-modality representation learning approaches such as LXMERT and
ViLBERT have been proposed recently. Such approaches can achieve superior
performance due to the high-level semantic information captured during
large-scale multimodal pretraining. However, as ViLBERT and LXMERT adopt visual
region regression and classification loss, they often suffer from domain gap
and noisy label problems, based on the visual features having been pretrained
on the Visual Genome dataset. To overcome these issues, we propose unbiased
Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual
self-supervised loss built upon contrastive learning. We evaluate CVLP on
several down-stream tasks, including VQA, GQA and NLVR2 to validate the
superiority of contrastive learning on multi-modality representation learning.
Our code is available at: https://github.com/ArcherYunDong/CVLP-.
Related papers
- Conditional Prototype Rectification Prompt Learning [32.533844163120875]
We propose a Prototype Rectification Prompt Learning (CPR) method to correct the bias of base examples and augment limited data in an effective way.
CPR achieves state-of-the-art performance on both few-shot classification and base-to-new generalization tasks.
arXiv Detail & Related papers (2024-04-15T15:43:52Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language [10.57079240576682]
Visual and linguistic pre-training aims to learn vision and language representations together.
Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks.
We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
arXiv Detail & Related papers (2023-04-10T05:54:03Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Efficient Vision-Language Pretraining with Visual Concepts and
Hierarchical Alignment [40.677139679304936]
We propose a new framework, dubbed ViCHA, that efficiently exploits the input data to boost the learning by: (a) a new hierarchical cross-modal alignment loss, (b) new self-supervised scheme based on masked image modeling, and (c) leveraging image-level annotations.
Although pretrained on four times less data, our ViCHA strategy outperforms other approaches on several downstream tasks such as Image-Text Retrieval, VQA, Visual Reasoning, Visual Entailment and Visual Grounding.
arXiv Detail & Related papers (2022-08-29T14:24:08Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.