The effectiveness of MAE pre-pretraining for billion-scale pretraining
- URL: http://arxiv.org/abs/2303.13496v3
- Date: Thu, 25 Jan 2024 03:20:12 GMT
- Title: The effectiveness of MAE pre-pretraining for billion-scale pretraining
- Authors: Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav
Aggarwal, Aaron Adcock, Armand Joulin, Piotr Doll\'ar, Christoph
Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
- Abstract summary: We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model.
We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
- Score: 65.98338857597935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper revisits the standard pretrain-then-finetune paradigm used in
computer vision for visual recognition tasks. Typically, state-of-the-art
foundation models are pretrained using large scale (weakly) supervised datasets
with billions of images. We introduce an additional pre-pretraining stage that
is simple and uses the self-supervised MAE technique to initialize the model.
While MAE has only been shown to scale with the size of models, we find that it
scales with the size of the training dataset as well. Thus, our MAE-based
pre-pretraining scales with both model and data size making it applicable for
training foundation models. Pre-pretraining consistently improves both the
model convergence and the downstream transfer performance across a range of
model scales (millions to billions of parameters), and dataset sizes (millions
to billions of images). We measure the effectiveness of pre-pretraining on 10
different visual recognition tasks spanning image classification, video
recognition, object detection, low-shot classification and zero-shot
recognition. Our largest model achieves new state-of-the-art results on
iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and
zero-shot transfer on Food-101 (96.2%). Our study reveals that model
initialization plays a significant role, even for web-scale pretraining with
billions of images, and our models are available publicly.
Related papers
- Scalable Pre-training of Large Autoregressive Image Models [65.824197847617]
This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective.
We highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, and (2) the value of the objective function correlates with the performance of the model on downstream tasks.
arXiv Detail & Related papers (2024-01-16T18:03:37Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - On Data Scaling in Masked Image Modeling [36.00347416479826]
Masked image modeling (MIM) is suspected to be unable to benefit from larger data.
Data scales ranging from 10% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations.
validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks.
arXiv Detail & Related papers (2022-06-09T17:58:24Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.