EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
- URL: http://arxiv.org/abs/2402.04252v1
- Date: Tue, 6 Feb 2024 18:59:48 GMT
- Title: EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
- Authors: Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong
Zhang, Xinlong Wang
- Abstract summary: We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date with 18-billion parameters.
With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks.
- Score: 25.729577042823514
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling up contrastive language-image pretraining (CLIP) is critical for
empowering both vision and multimodal models. We present EVA-CLIP-18B, the
largest and most powerful open-source CLIP model to date, with 18-billion
parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an
exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized
image classification benchmarks, outperforming its forerunner EVA-CLIP
(5-billion parameters) and other open-source CLIP models by a large margin.
Remarkably, we observe a consistent performance improvement with the model size
scaling of EVA-CLIP, despite maintaining a constant training dataset of
2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly
available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B)
employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the
potential of EVA-style weak-to-strong visual model scaling. With our model
weights made publicly available, we hope to facilitate future research in
vision and multimodal foundation models.
Related papers
- Accessing Vision Foundation Models via ImageNet-1K [51.521125501182816]
Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community.
Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.
arXiv Detail & Related papers (2024-07-15T00:13:53Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations [67.25974711647481]
Text-to-image (T2I) diffusion models, notably the unCLIP models, achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks.
We introduce ECLIPSE, a novel contrastive learning method that is both parameter and data-efficient.
We demonstrate that ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere 2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% preference score.
arXiv Detail & Related papers (2023-12-07T19:32:39Z) - MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training [17.158498267947877]
We introduce MobileCLIP, a new family of efficient image-text models optimized for runtime performance.
MobileCLIP uses knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models.
Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset.
arXiv Detail & Related papers (2023-11-28T18:55:42Z) - EVA-CLIP: Improved Training Techniques for CLIP at Scale [20.145062325090286]
We propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training.
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance.
arXiv Detail & Related papers (2023-03-27T17:02:21Z) - EVA-02: A Visual Representation for Neon Genesis [49.90565085768437]
EVA-02 is a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features.
We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance.
arXiv Detail & Related papers (2023-03-20T17:59:59Z) - Face Recognition in the age of CLIP & Billion image datasets [0.0]
We evaluate the performance of various CLIP models as zero-shot face recognizers.
We also investigate the robustness of CLIP models against data poisoning attacks.
arXiv Detail & Related papers (2023-01-18T05:34:57Z) - Rethinking Mobile Block for Efficient Attention-based Models [60.0312591342016]
This paper focuses on developing modern, efficient, lightweight models for dense predictions while trading off parameters, FLOPs, and performance.
Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterpart has been recognized by attention-based studies.
We extend CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMB) for lightweight model design.
arXiv Detail & Related papers (2023-01-03T15:11:41Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.