On Data Scaling in Masked Image Modeling
- URL: http://arxiv.org/abs/2206.04664v1
- Date: Thu, 9 Jun 2022 17:58:24 GMT
- Title: On Data Scaling in Masked Image Modeling
- Authors: Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han
Hu
- Abstract summary: Masked image modeling (MIM) is suspected to be unable to benefit from larger data.
Data scales ranging from 10% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations.
validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks.
- Score: 36.00347416479826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An important goal of self-supervised learning is to enable model pre-training
to benefit from almost unlimited data. However, one method that has recently
become popular, namely masked image modeling (MIM), is suspected to be unable
to benefit from larger data. In this work, we break this misconception through
extensive experiments, with data scales ranging from 10\% of ImageNet-1K to
full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and
training lengths ranging from 125K iterations to 500K iterations. Our study
reveals that: (i) Masked image modeling is also demanding on larger data. We
observed that very large models got over-fitted with relatively small data;
(ii) The length of training matters. Large models trained with masked image
modeling can benefit from more data with longer training; (iii) The validation
loss in pre-training is a good indicator to measure how well the model performs
for fine-tuning on multiple tasks. This observation allows us to pre-evaluate
pre-trained models in advance without having to make costly trial-and-error
assessments of downstream tasks. We hope that our findings will advance the
understanding of masked image modeling in terms of scaling ability.
Related papers
- A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Masked Diffusion Models Are Fast Distribution Learners [32.485235866596064]
Diffusion models are commonly trained to learn all fine-grained visual information from scratch.
We show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution.
Then the pre-trained model can be fine-tuned for various generation tasks efficiently.
arXiv Detail & Related papers (2023-06-20T08:02:59Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - The effectiveness of MAE pre-pretraining for billion-scale pretraining [65.98338857597935]
We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model.
We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
arXiv Detail & Related papers (2023-03-23T17:56:12Z) - Could Giant Pretrained Image Models Extract Universal Representations? [94.97056702288317]
We present a study of frozen pretrained models when applied to diverse and representative computer vision tasks.
Our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes.
arXiv Detail & Related papers (2022-11-03T17:57:10Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.