Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark
of Data, Model, and Supervision
- URL: http://arxiv.org/abs/2203.05796v1
- Date: Fri, 11 Mar 2022 08:41:00 GMT
- Title: Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark
of Data, Model, and Supervision
- Authors: Yufeng Cui, Lichen Zhao, Feng Liang, Yangguang Li, Jing Shao
- Abstract summary: Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision.
We propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants.
- Score: 26.13829720290035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm
to learn visual models from language supervision. While researchers continue to
push the frontier of CLIP, reproducing these works remains challenging. This is
because researchers do not choose consistent training recipes and even use
different data, hampering the fair comparison between different methods. In
this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and
benchmark CLIP and its variants. We conduct a comprehensive analysis of three
key factors: data, supervision, and model architecture. We find considerable
intuitive or counter-intuitive insights: (1). Data quality has a significant
impact on performance. (2). Certain supervision has different effects for
Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more
proper supervision can effectively improve the performance of CLIP. (3).
Curtailing the text encoder reduces the training cost but not much affect the
final performance. Moreover, we further combine DeCLIP with FILIP, bringing us
the strongest variant DeFILIP. The CLIP-benchmark would be released at:
https://github.com/Sense-GVT/DeCLIP for future CLIP research.
Related papers
- Toward a Holistic Evaluation of Robustness in CLIP Models [11.148206692373144]
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential in zero-shot classification.
This work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives.
In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts.
arXiv Detail & Related papers (2024-10-02T13:26:17Z) - Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies [27.809995478990544]
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets.
We show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality.
We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource.
arXiv Detail & Related papers (2024-04-12T02:04:34Z) - Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
We investigate the differences in CLIP performance among various neural architectures.
We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - The CLEAR Benchmark: Continual LEArning on Real-World Imagery [77.98377088698984]
Continual learning (CL) is widely regarded as crucial challenge for lifelong AI.
We introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts.
We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms.
arXiv Detail & Related papers (2022-01-17T09:09:09Z) - CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [85.37552507367175]
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space.
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
arXiv Detail & Related papers (2022-01-15T01:54:01Z) - SLIP: Self-supervision meets Language-Image Pre-training [79.53764315471543]
We study whether self-supervised learning can aid in the use of language supervision for visual representation learning.
We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training.
We find that SLIP enjoys the best of both worlds: better performance than self-supervision and language supervision.
arXiv Detail & Related papers (2021-12-23T18:07:13Z) - How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.