CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment
- URL: http://arxiv.org/abs/2203.07190v1
- Date: Mon, 14 Mar 2022 15:29:27 GMT
- Title: CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment
- Authors: Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, Furu Wei
- Abstract summary: We show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.
We propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task.
- Score: 102.17010696898113
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: CLIP has shown a remarkable zero-shot capability on a wide range of vision
tasks. Previously, CLIP is only regarded as a powerful visual encoder. However,
after being pre-trained by language supervision from a large amount of
image-caption pairs, CLIP itself should also have acquired some few-shot
abilities for vision-language tasks. In this work, we empirically show that
CLIP can be a strong vision-language few-shot learner by leveraging the power
of language. We first evaluate CLIP's zero-shot performance on a typical visual
question answering task and demonstrate a zero-shot cross-modality transfer
capability of CLIP on the visual entailment task. Then we propose a
parameter-efficient fine-tuning strategy to boost the few-shot performance on
the vqa task. We achieve competitive zero/few-shot results on the visual
question answering and visual entailment tasks without introducing any
additional pre-training procedure.
Related papers
- Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation [17.443411731092567]
Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity.
We ask if Vision-Language models like CLIP are also capable of zero-shot language grounding.
arXiv Detail & Related papers (2022-11-30T00:38:54Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP model is an Efficient Continual Learner [26.835116431183625]
We show that a frozen CLIP model offers astounding continual learning performance without any fine-tuning (zero-shot evaluation)
We evaluate CLIP under a variety of settings including class-incremental, domain-incremental and task-agnostic incremental learning on five popular benchmarks.
arXiv Detail & Related papers (2022-10-06T17:59:15Z) - CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [85.37552507367175]
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space.
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
arXiv Detail & Related papers (2022-01-15T01:54:01Z) - How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.