Exploring the Limits of Large Scale Pre-training
- URL: http://arxiv.org/abs/2110.02095v1
- Date: Tue, 5 Oct 2021 14:49:00 GMT
- Title: Exploring the Limits of Large Scale Pre-training
- Authors: Samira Abnar and Mostafa Dehghani and Behnam Neyshabur and Hanie
Sedghi
- Abstract summary: Recent developments in large-scale machine learning suggest that improvements in pre-training would transfer favorably to most downstream tasks.
We study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates.
We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks.
- Score: 34.18163065498687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent developments in large-scale machine learning suggest that by scaling
up data, model size and training time properly, one might observe that
improvements in pre-training would transfer favorably to most downstream tasks.
In this work, we systematically study this phenomena and establish that, as we
increase the upstream accuracy, the performance of downstream tasks saturates.
In particular, we investigate more than 4800 experiments on Vision
Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten
million to ten billion, trained on the largest scale of available image data
(JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition
tasks. We propose a model for downstream performance that reflects the
saturation phenomena and captures the nonlinear relationship in performance of
upstream and downstream tasks. Delving deeper to understand the reasons that
give rise to these phenomena, we show that the saturation behavior we observe
is closely related to the way that representations evolve through the layers of
the models. We showcase an even more extreme scenario where performance on
upstream and downstream are at odds with each other. That is, to have a better
downstream performance, we need to hurt upstream accuracy.
Related papers
- An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - FLODCAST: Flow and Depth Forecasting via Multimodal Recurrent
Architectures [31.879514593973195]
We propose a flow and depth forecasting model, trained to jointly forecast both modalities at once.
We train the proposed model to also perform predictions for several timesteps in the future.
We report benefits on the downstream task of segmentation forecasting, injecting our predictions in a flow-based mask-warping framework.
arXiv Detail & Related papers (2023-10-31T16:30:16Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Could Giant Pretrained Image Models Extract Universal Representations? [94.97056702288317]
We present a study of frozen pretrained models when applied to diverse and representative computer vision tasks.
Our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes.
arXiv Detail & Related papers (2022-11-03T17:57:10Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - On Data Scaling in Masked Image Modeling [36.00347416479826]
Masked image modeling (MIM) is suspected to be unable to benefit from larger data.
Data scales ranging from 10% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations.
validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks.
arXiv Detail & Related papers (2022-06-09T17:58:24Z) - How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets.
In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset.
We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.