EVA: Exploring the Limits of Masked Visual Representation Learning at
Scale
- URL: http://arxiv.org/abs/2211.07636v1
- Date: Mon, 14 Nov 2022 18:59:52 GMT
- Title: EVA: Exploring the Limits of Masked Visual Representation Learning at
Scale
- Authors: Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang,
Tiejun Huang, Xinlong Wang, Yue Cao
- Abstract summary: We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale.
EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches.
We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute.
- Score: 46.952339726872374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We launch EVA, a vision-centric foundation model to explore the limits of
visual representation at scale using only publicly accessible data. EVA is a
vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision
features conditioned on visible image patches. Via this pretext task, we can
efficiently scale up EVA to one billion parameters, and sets new records on a
broad range of representative vision downstream tasks, such as image
recognition, video action recognition, object detection, instance segmentation
and semantic segmentation without heavy supervised training. Moreover, we
observe quantitative changes in scaling EVA result in qualitative changes in
transfer learning performance that are not present in other models. For
instance, EVA takes a great leap in the challenging large vocabulary instance
segmentation task: our model achieves almost the same state-of-the-art
performance on LVISv1.0 dataset with over a thousand categories and COCO
dataset with only eighty categories. Beyond a pure vision encoder, EVA can also
serve as a vision-centric, multi-modal pivot to connect images and text. We
find initializing the vision tower of a giant CLIP from EVA can greatly
stabilize the training and outperform the training from scratch counterpart
with much fewer samples and less compute, providing a new direction for scaling
up and accelerating the costly training of multi-modal foundation models. To
facilitate future research, we will release all the code and models at
\url{https://github.com/baaivision/EVA}.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - EVE: Efficient Vision-Language Pre-training with Masked Prediction and
Modality-Aware MoE [66.48689706116808]
Efficient Vision-languagE is one unified multimodal Transformer pre-trained solely by one unified pre-training task.
Eve encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts.
Eve achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
arXiv Detail & Related papers (2023-08-23T07:36:30Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.