CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
- URL: http://arxiv.org/abs/2201.05729v1
- Date: Sat, 15 Jan 2022 01:54:01 GMT
- Title: CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
- Authors: Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang,
Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan
- Abstract summary: Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space.
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
- Score: 85.37552507367175
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Contrastive language-image pretraining (CLIP) links vision and language
modalities into a unified embedding space, yielding the tremendous potential
for vision-language (VL) tasks. While early concurrent works have begun to
study this potential on a subset of tasks, important questions remain: 1) What
is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in
low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches
without impacting inference or pretraining complexity? In this work, we seek to
answer these questions through two key contributions. First, we introduce an
evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual
Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of
data availability constraints and conditions of domain shift. Second, we
propose an approach, named CLIP Targeted Distillation (CLIP-TD), to
intelligently distill knowledge from CLIP into existing architectures using a
dynamically weighted objective applied to adaptively selected tokens per
instance. Experiments demonstrate that our proposed CLIP-TD leads to
exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to
71.3%) conditions of VCR, while simultaneously improving performance under
standard fully-supervised conditions (up to 2%), achieving state-of-art
performance on VCR compared to other single models that are pretrained with
image-text data only. On SNLI-VE, CLIP-TD produces significant gains in
low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On
VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in
fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works
utilizing CLIP for finetuning, as well as baseline naive distillation
approaches. Code will be made available.
Related papers
- FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs [24.991684983495542]
This paper proposes FairerCLIP, a general approach for making zero-shot predictions of CLIP more fair and robust to spurious correlations.
We formulate the problem of jointly debiasing CLIP's image and text representations in reproducing Hilbert kernel spaces (RKHSs)
arXiv Detail & Related papers (2024-03-22T19:41:26Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment [102.17010696898113]
We show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.
We propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task.
arXiv Detail & Related papers (2022-03-14T15:29:27Z) - Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark
of Data, Model, and Supervision [26.13829720290035]
Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision.
We propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants.
arXiv Detail & Related papers (2022-03-11T08:41:00Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z) - How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.