EVA-02: A Visual Representation for Neon Genesis
- URL: http://arxiv.org/abs/2303.11331v2
- Date: Wed, 22 Mar 2023 14:10:37 GMT
- Title: EVA-02: A Visual Representation for Neon Genesis
- Authors: Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue
Cao
- Abstract summary: EVA-02 is a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features.
We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance.
- Score: 49.90565085768437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We launch EVA-02, a next-generation Transformer-based visual representation
pre-trained to reconstruct strong and robust language-aligned vision features
via masked image modeling. With an updated plain Transformer architecture as
well as extensive pre-training from an open & accessible giant CLIP vision
encoder, EVA-02 demonstrates superior performance compared to prior
state-of-the-art approaches across various representative vision tasks, while
utilizing significantly fewer parameters and compute budgets. Notably, using
exclusively publicly accessible training data, EVA-02 with only 304M parameters
achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set.
Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on
ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with
only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02
variants in various model sizes, ranging from 6M to 304M parameters, all with
impressive performance. To facilitate open access and open research, we release
the complete suite of EVA-02 to the community at
https://github.com/baaivision/EVA/tree/master/EVA-02.
Related papers
- EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation [15.590340765703893]
We present EgoPoseFormer, a transformer-based model for stereo egocentric human pose estimation.
Our approach overcomes the main challenge of overcoming joint invisibility caused by self-occlusion or a limited field of view (FOV) of head-mounted cameras.
We evaluate our method on the stereo UnrealEgo dataset and show it significantly outperforms previous approaches.
arXiv Detail & Related papers (2024-03-26T20:02:48Z) - EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters [25.729577042823514]
We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date with 18-billion parameters.
With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks.
arXiv Detail & Related papers (2024-02-06T18:59:48Z) - Sparse then Prune: Toward Efficient Vision Transformers [2.191505742658975]
Vision Transformer is a deep learning model inspired by the success of the Transformer model in Natural Language Processing.
Applying Sparse Regularization to Vision Transformers can increase accuracy by 0.12%.
Applying pruning to models with Sparse Regularization yields even better results.
arXiv Detail & Related papers (2023-07-22T05:43:33Z) - EVA-CLIP: Improved Training Techniques for CLIP at Scale [20.145062325090286]
We propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training.
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance.
arXiv Detail & Related papers (2023-03-27T17:02:21Z) - EVA: Exploring the Limits of Masked Visual Representation Learning at
Scale [46.952339726872374]
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale.
EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches.
We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute.
arXiv Detail & Related papers (2022-11-14T18:59:52Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - Better plain ViT baselines for ImageNet-1k [100.80574771242937]
It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data.
This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models.
arXiv Detail & Related papers (2022-05-03T15:54:44Z) - ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [76.35955924137986]
We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets.
Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
arXiv Detail & Related papers (2022-04-26T17:55:04Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.