Selective Feature Adapter for Dense Vision Transformers
- URL: http://arxiv.org/abs/2310.01843v1
- Date: Tue, 3 Oct 2023 07:17:58 GMT
- Title: Selective Feature Adapter for Dense Vision Transformers
- Authors: Xueqing Deng, Qi Fan, Xiaojie Jin, Linjie Yang and Peng Wang
- Abstract summary: selective feature adapter (SFA) achieves comparable or better performance than fully fine-tuned models across various dense tasks.
SFA consists of external adapters and internal adapters which are sequentially operated over a transformer model.
Experiments show that the dual adapter module, a.k.a SFA, is essential to achieve the best trade-off on dense vision tasks.
- Score: 30.409313135985528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning pre-trained transformer models, e.g., Swin Transformer, are
successful in numerous downstream for dense prediction vision tasks. However,
one major issue is the cost/storage of their huge amount of parameters, which
becomes increasingly challenging to handle with the growing amount of vision
tasks. In this paper, we propose an effective approach to alleviate the issue,
namely selective feature adapter (SFA). It achieves state-of-the-art (SoTA)
performance under any given budget of trainable parameters, and demonstrates
comparable or better performance than fully fine-tuned models across various
dense tasks. Specifically, SFA consists of external adapters and internal
adapters which are sequentially operated over a transformer model. For external
adapters, we properly select the places and amount of additional multilayer
perception (MLP). For internal adapters, we transform a few task-important
parameters inside the transformer, which are automatically discovered through a
simple yet effective lottery ticket algorithm. Our experiments show that the
dual adapter module, a.k.a SFA, is essential to achieve the best trade-off on
dense vision tasks, such as segmentation, detection and depth-estimation,
outperforming other adapters with a single module.
Related papers
- PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context [9.235131774252416]
We show how to achieve the best task-adaptation performance and introduce PETAH: Efficient Task Adaptation for Hybrid Transformers.
Our PETAH-adapted hybrid models outperform established task-adaptation techniques for ViTs while requiring fewer parameters and being more efficient on mobile hardware.
arXiv Detail & Related papers (2024-10-23T08:24:47Z) - Mini but Mighty: Finetuning ViTs with Mini Adapters [7.175668563148084]
adapters perform poorly when the dimension of adapters is small.
We propose MiMi, a training framework that addresses this issue.
Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters.
arXiv Detail & Related papers (2023-11-07T10:41:27Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - Vision Transformer Adapter for Dense Predictions [57.590511173416445]
Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images.
We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models.
We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2022-05-17T17:59:11Z) - AdapterBias: Parameter-efficient Token-dependent Representation Shift
for Adapters in NLP Tasks [55.705355299065474]
Transformer-based pre-trained models with millions of parameters require large storage.
Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters.
In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed.
arXiv Detail & Related papers (2022-04-30T16:49:41Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Plug-In Inversion: Model-Agnostic Inversion for Vision with Data
Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning.
We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.