Convolutional Bypasses Are Better Vision Transformer Adapters
- URL: http://arxiv.org/abs/2207.07039v2
- Date: Mon, 18 Jul 2022 17:48:37 GMT
- Title: Convolutional Bypasses Are Better Vision Transformer Adapters
- Authors: Shibo Jie and Zhi-Hong Deng
- Abstract summary: As the size of Vision Transformer (ViT) grows exponentially, the full finetuning becomes prohibitive in view of the heavier storage overhead.
Recent studies attempt to insert lightweight adaptation modules to pretrained ViT and only finetune these modules while the pretrained weights are frozen.
In this paper, we propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation modules, introducing only a small amount of trainable parameters to adapt the large ViT.
- Score: 14.993203705812654
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The pretrain-then-finetune paradigm has been widely adopted in computer
vision. But as the size of Vision Transformer (ViT) grows exponentially, the
full finetuning becomes prohibitive in view of the heavier storage overhead.
Motivated by parameter-efficient transfer learning (PETL) on language
transformers, recent studies attempt to insert lightweight adaptation modules
(e.g., adapter layers or prompt tokens) to pretrained ViT and only finetune
these modules while the pretrained weights are frozen. However, these modules
were originally proposed to finetune language models. Although ported well to
ViT, their design lacks prior knowledge for visual tasks. In this paper, we
propose to construct Convolutional Bypasses (Convpass) in ViT as adaptation
modules, introducing only a small amount (less than 0.5% of model parameters)
of trainable parameters to adapt the large ViT. Different from other PETL
methods, Convpass benefits from the hard-coded inductive bias of convolutional
layers and thus is more suitable for visual tasks, especially in the low-data
regime. Experimental results on VTAB-1k benchmark and few-shot learning
datasets demonstrate that Convpass outperforms current language-oriented
adaptation modules, demonstrating the necessity to tailor vision-oriented
adaptation modules for vision models.
Related papers
- Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves comparable or even superior performance compared to existing PEFT methods.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Selective Feature Adapter for Dense Vision Transformers [30.409313135985528]
selective feature adapter (SFA) achieves comparable or better performance than fully fine-tuned models across various dense tasks.
SFA consists of external adapters and internal adapters which are sequentially operated over a transformer model.
Experiments show that the dual adapter module, a.k.a SFA, is essential to achieve the best trade-off on dense vision tasks.
arXiv Detail & Related papers (2023-10-03T07:17:58Z) - Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z) - PVP: Pre-trained Visual Parameter-Efficient Tuning [29.05396521860764]
Large-scale pre-trained transformers have demonstrated remarkable success in various computer vision tasks.
It is still highly challenging to fully fine-tune these models for downstream tasks due to their high computational and storage costs.
We propose a Pre-trained Visual.
efficient (PVP) Tuning framework, which pre-trains the parameter-efficient tuning modules first and then leverages the pre-trained modules.
arXiv Detail & Related papers (2023-04-26T15:55:29Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - AdaptFormer: Adapting Vision Transformers for Scalable Visual
Recognition [39.443380221227166]
We propose an effective adaptation approach for Transformer, namely AdaptFormer.
It can adapt the pre-trained ViTs into many different image and video tasks efficiently.
It is able to increase the ViT's transferability without updating its original pre-trained parameters.
arXiv Detail & Related papers (2022-05-26T17:56:15Z) - Visual Prompt Tuning [74.5309408185523]
This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision.
VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen.
arXiv Detail & Related papers (2022-03-23T01:17:16Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.