AdaptFormer: Adapting Vision Transformers for Scalable Visual
Recognition
- URL: http://arxiv.org/abs/2205.13535v1
- Date: Thu, 26 May 2022 17:56:15 GMT
- Title: AdaptFormer: Adapting Vision Transformers for Scalable Visual
Recognition
- Authors: Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue
Wang, Ping Luo
- Abstract summary: We propose an effective adaptation approach for Transformer, namely AdaptFormer.
It can adapt the pre-trained ViTs into many different image and video tasks efficiently.
It is able to increase the ViT's transferability without updating its original pre-trained parameters.
- Score: 39.443380221227166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although the pre-trained Vision Transformers (ViTs) achieved great success in
computer vision, adapting a ViT to various image and video tasks is challenging
because of its heavy computation and storage burdens, where each model needs to
be independently and comprehensively fine-tuned to different tasks, limiting
its transferability in different domains. To address this challenge, we propose
an effective adaptation approach for Transformer, namely AdaptFormer, which can
adapt the pre-trained ViTs into many different image and video tasks
efficiently. It possesses several benefits more appealing than prior arts.
Firstly, AdaptFormer introduces lightweight modules that only add less than 2%
extra parameters to a ViT, while it is able to increase the ViT's
transferability without updating its original pre-trained parameters,
significantly outperforming the existing 100% fully fine-tuned models on action
recognition benchmarks. Secondly, it can be plug-and-play in different
Transformers and scalable to many visual tasks. Thirdly, extensive experiments
on five image and video datasets show that AdaptFormer largely improves ViTs in
the target domains. For example, when updating just 1.5% extra parameters, it
achieves about 10% and 19% relative improvement compared to the fully
fine-tuned models on Something-Something~v2 and HMDB51, respectively. Project
page: http://www.shoufachen.com/adaptformer-page.
Related papers
- Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - Vision Transformer Adapter for Dense Predictions [57.590511173416445]
Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images.
We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models.
We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2022-05-17T17:59:11Z) - MiniViT: Compressing Vision Transformers with Weight Multiplexing [88.54212027516755]
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability.
MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
arXiv Detail & Related papers (2022-04-14T17:59:05Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - TerViT: An Efficient Ternary Vision Transformer [21.348788407233265]
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices.
We introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters.
arXiv Detail & Related papers (2022-01-20T08:29:19Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.