An Inverse Scaling Law for CLIP Training
- URL: http://arxiv.org/abs/2305.07017v2
- Date: Sat, 28 Oct 2023 23:25:10 GMT
- Title: An Inverse Scaling Law for CLIP Training
- Authors: Xianhang Li, Zeyu Wang, Cihang Xie
- Abstract summary: We present a finding that there exists an inverse scaling law for CLIP training.
We are able to successfully train CLIP even with limited computational resources.
- Score: 24.961315762769893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CLIP, one of the pioneering foundation models that connect images and text,
has enabled many recent breakthroughs in computer vision. However, its
associated training cost is prohibitively high, imposing a significant barrier
to its widespread exploration. In this paper, we present a surprising finding
that there exists an inverse scaling law for CLIP training, whereby the larger
the image/text encoders used, the shorter the sequence length of image/text
tokens that can be applied in training. Moreover, we showcase that the strategy
for reducing image/text token length plays a crucial role in determining the
quality of this scaling law.
As a result of this finding, we are able to successfully train CLIP even with
limited computational resources. For example, using 8 A100 GPUs, our CLIP
models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days,
67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling
up -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot
accuracy, and meanwhile accelerate the training by ~33x compared to its
OpenCLIP counterpart. By reducing the computation barrier associated with CLIP,
we hope to inspire more research in this field, particularly from academics.
Our code is available at https://github.com/UCSC-VLAA/CLIPA.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
Inheritance [97.01406871579525]
We propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models.
We show that TinyCLIP can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance.
Our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet.
arXiv Detail & Related papers (2023-09-21T17:59:53Z) - CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy
within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy [20.495259430916814]
We present CLIPA-v2, an inverse scaling law for CLIP training.
We extend the experiments up to the H/14 model with 13B image-text pairs.
Our model achieves an impressive zero-shot ImageNet accuracy of 81.1%.
arXiv Detail & Related papers (2023-06-27T17:51:06Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - RECLIP: Resource-efficient CLIP by Training with Small Images [44.7490122024181]
We present RECLIP, a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining)
Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently.
arXiv Detail & Related papers (2023-04-12T17:59:58Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - Attentive Mask CLIP [48.206857783966996]
We propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description.
Our approach achieves $43.9%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy.
arXiv Detail & Related papers (2022-12-16T18:59:12Z) - ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.