CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy
within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy
- URL: http://arxiv.org/abs/2306.15658v1
- Date: Tue, 27 Jun 2023 17:51:06 GMT
- Title: CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy
within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy
- Authors: Xianhang Li, Zeyu Wang, Cihang Xie
- Abstract summary: We present CLIPA-v2, an inverse scaling law for CLIP training.
We extend the experiments up to the H/14 model with 13B image-text pairs.
Our model achieves an impressive zero-shot ImageNet accuracy of 81.1%.
- Score: 20.495259430916814
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The recent work CLIPA presents an inverse scaling law for CLIP training --
whereby the larger the image/text encoders used, the shorter the sequence
length of image/text tokens that can be applied in training. This finding
enables us to train high-performance CLIP models with significantly reduced
computations. Building upon this work, we hereby present CLIPA-v2 with two key
contributions. Technically, we find this inverse scaling law is also applicable
in the finetuning stage, enabling further reduction in computational needs.
Empirically, we explore CLIPA at scale, extending the experiments up to the
H/14 model with ~13B image-text pairs seen during training.
Our results are exciting -- by only allocating a budget of \$10,000, our CLIP
model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing
the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing
the computational cost by ~39X. Moreover, with an additional investment of
$4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our
code and models are available at https://github.com/UCSC-VLAA/CLIPA.
Related papers
- TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
Inheritance [97.01406871579525]
We propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models.
We show that TinyCLIP can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance.
Our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet.
arXiv Detail & Related papers (2023-09-21T17:59:53Z) - An Inverse Scaling Law for CLIP Training [24.961315762769893]
We present a finding that there exists an inverse scaling law for CLIP training.
We are able to successfully train CLIP even with limited computational resources.
arXiv Detail & Related papers (2023-05-11T17:56:09Z) - EVA-CLIP: Improved Training Techniques for CLIP at Scale [20.145062325090286]
We propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training.
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance.
arXiv Detail & Related papers (2023-03-27T17:02:21Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - Attentive Mask CLIP [48.206857783966996]
We propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description.
Our approach achieves $43.9%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy.
arXiv Detail & Related papers (2022-12-16T18:59:12Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.