How Much Can CLIP Benefit Vision-and-Language Tasks?
- URL: http://arxiv.org/abs/2107.06383v1
- Date: Tue, 13 Jul 2021 20:48:12 GMT
- Title: How Much Can CLIP Benefit Vision-and-Language Tasks?
- Authors: Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach,
Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
- Abstract summary: We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
- Score: 121.46042421728016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing Vision-and-Language (V&L) models rely on pre-trained visual
encoders, using a relatively small set of manually-annotated data (as compared
to web-crawled data), to perceive the visual world. However, it has been
observed that large-scale pretraining usually can result in better
generalization performance, e.g., CLIP (Contrastive Language-Image
Pre-training), trained on a massive amount of image-caption pairs, has shown a
strong zero-shot capability on various vision tasks. To further study the
advantage brought by CLIP, we propose to use CLIP as the visual encoder in
various V&L models in two typical scenarios: 1) plugging CLIP into
task-specific fine-tuning; 2) combining CLIP with V&L pre-training and
transferring to downstream tasks. We show that CLIP significantly outperforms
widely-used visual encoders trained with in-domain annotated data, such as
BottomUp-TopDown. We achieve competitive or better results on diverse V&L
tasks, while establishing new state-of-the-art results on Visual Question
Answering, Visual Entailment, and V&L Navigation tasks. We release our code at
https://github.com/clip-vil/CLIP-ViL.
Related papers
- Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment [102.17010696898113]
We show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.
We propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task.
arXiv Detail & Related papers (2022-03-14T15:29:27Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.