Could Giant Pretrained Image Models Extract Universal Representations?
- URL: http://arxiv.org/abs/2211.02043v1
- Date: Thu, 3 Nov 2022 17:57:10 GMT
- Title: Could Giant Pretrained Image Models Extract Universal Representations?
- Authors: Yutong Lin, Ze Liu, Zheng Zhang, Han Hu, Nanning Zheng, Stephen Lin,
Yue Cao
- Abstract summary: We present a study of frozen pretrained models when applied to diverse and representative computer vision tasks.
Our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes.
- Score: 94.97056702288317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Frozen pretrained models have become a viable alternative to the
pretraining-then-finetuning paradigm for transfer learning. However, with
frozen models there are relatively few parameters available for adapting to
downstream tasks, which is problematic in computer vision where tasks vary
significantly in input/output format and the type of information that is of
value. In this paper, we present a study of frozen pretrained models when
applied to diverse and representative computer vision tasks, including object
detection, semantic segmentation and video action recognition. From this
empirical analysis, our work answers the questions of what pretraining task
fits best with this frozen setting, how to make the frozen setting more
flexible to various downstream tasks, and the effect of larger model sizes. We
additionally examine the upper bound of performance using a giant frozen
pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches
competitive performance on a varied set of major benchmarks with only one
shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object
detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7
top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to
bring greater attention to this promising path of freezing pretrained image
models.
Related papers
- MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders [17.564722905991776]
We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features.
Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.
arXiv Detail & Related papers (2023-04-25T03:01:37Z) - The effectiveness of MAE pre-pretraining for billion-scale pretraining [65.98338857597935]
We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model.
We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
arXiv Detail & Related papers (2023-03-23T17:56:12Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - On Data Scaling in Masked Image Modeling [36.00347416479826]
Masked image modeling (MIM) is suspected to be unable to benefit from larger data.
Data scales ranging from 10% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations.
validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks.
arXiv Detail & Related papers (2022-06-09T17:58:24Z) - Exploring the Limits of Large Scale Pre-training [34.18163065498687]
Recent developments in large-scale machine learning suggest that improvements in pre-training would transfer favorably to most downstream tasks.
We study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates.
We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks.
arXiv Detail & Related papers (2021-10-05T14:49:00Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.