EVA-CLIP: Improved Training Techniques for CLIP at Scale
- URL: http://arxiv.org/abs/2303.15389v1
- Date: Mon, 27 Mar 2023 17:02:21 GMT
- Title: EVA-CLIP: Improved Training Techniques for CLIP at Scale
- Authors: Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao
- Abstract summary: We propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training.
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance.
- Score: 20.145062325090286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language-image pre-training, CLIP for short, has gained
increasing attention for its potential in various scenarios. In this paper, we
propose EVA-CLIP, a series of models that significantly improve the efficiency
and effectiveness of CLIP training. Our approach incorporates new techniques
for representation learning, optimization, and augmentation, enabling EVA-CLIP
to achieve superior performance compared to previous CLIP models with the same
number of parameters but significantly smaller training costs. Notably, our
largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples
achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller
EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples
achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open
access and open research, we release the complete suite of EVA-CLIP to the
community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.
Related papers
- CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling [21.65268178160724]
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs.
We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture.
arXiv Detail & Related papers (2025-02-03T00:04:50Z) - EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters [25.729577042823514]
We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date with 18-billion parameters.
With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks.
arXiv Detail & Related papers (2024-02-06T18:59:48Z) - TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
Inheritance [97.01406871579525]
We propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models.
We show that TinyCLIP can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance.
Our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet.
arXiv Detail & Related papers (2023-09-21T17:59:53Z) - EVA-02: A Visual Representation for Neon Genesis [49.90565085768437]
EVA-02 is a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features.
We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance.
arXiv Detail & Related papers (2023-03-20T17:59:59Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [85.37552507367175]
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space.
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
arXiv Detail & Related papers (2022-01-15T01:54:01Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.