Vision Transformer Adapter for Dense Predictions
- URL: http://arxiv.org/abs/2205.08534v2
- Date: Wed, 18 May 2022 01:27:12 GMT
- Title: Vision Transformer Adapter for Dense Predictions
- Authors: Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu
Qiao
- Abstract summary: Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images.
We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models.
We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
- Score: 57.590511173416445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work investigates a simple yet powerful adapter for Vision Transformer
(ViT). Unlike recent visual transformers that introduce vision-specific
inductive biases into their architectures, ViT achieves inferior performance on
dense prediction tasks due to lacking prior information of images. To solve
this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can
remedy the defects of ViT and achieve comparable performance to vision-specific
models by introducing inductive biases via an additional architecture.
Specifically, the backbone in our framework is a vanilla transformer that can
be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a
modality-specific adapter is used to introduce the data and tasks' prior
information into the model, making it suitable for these tasks. We verify the
effectiveness of our ViT-Adapter on multiple downstream tasks, including object
detection, instance segmentation, and semantic segmentation. Notably, when
using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO
test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic
segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU
on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed
ViT-Adapter could serve as an alternative for vision-specific transformers and
facilitate future research. The code and models will be released at
https://github.com/czczup/ViT-Adapter.
Related papers
- ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions [4.554319452683839]
Vision Transformer (ViT) has achieved significant success in computer vision, but does not perform well in dense prediction tasks.
We present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer.
We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features.
arXiv Detail & Related papers (2024-03-12T07:59:41Z) - Mini but Mighty: Finetuning ViTs with Mini Adapters [7.175668563148084]
adapters perform poorly when the dimension of adapters is small.
We propose MiMi, a training framework that addresses this issue.
Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters.
arXiv Detail & Related papers (2023-11-07T10:41:27Z) - Selective Feature Adapter for Dense Vision Transformers [30.409313135985528]
selective feature adapter (SFA) achieves comparable or better performance than fully fine-tuned models across various dense tasks.
SFA consists of external adapters and internal adapters which are sequentially operated over a transformer model.
Experiments show that the dual adapter module, a.k.a SFA, is essential to achieve the best trade-off on dense vision tasks.
arXiv Detail & Related papers (2023-10-03T07:17:58Z) - $E(2)$-Equivariant Vision Transformer [11.94180035256023]
Vision Transformer (ViT) has achieved remarkable performance in computer vision.
positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data.
We design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding operator.
arXiv Detail & Related papers (2023-06-11T16:48:03Z) - AdaptFormer: Adapting Vision Transformers for Scalable Visual
Recognition [39.443380221227166]
We propose an effective adaptation approach for Transformer, namely AdaptFormer.
It can adapt the pre-trained ViTs into many different image and video tasks efficiently.
It is able to increase the ViT's transferability without updating its original pre-trained parameters.
arXiv Detail & Related papers (2022-05-26T17:56:15Z) - ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [76.35955924137986]
We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets.
Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
arXiv Detail & Related papers (2022-04-26T17:55:04Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - VL-Adapter: Parameter-Efficient Transfer Learning for
Vision-and-Language Tasks [71.40656211497162]
Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks.
We introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5.
Our results demonstrate that training the adapter with the weight-sharing technique can match the performance of fine-tuning the entire model.
arXiv Detail & Related papers (2021-12-13T17:35:26Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
without Convolutions [103.03973037619532]
This work investigates a simple backbone network useful for many dense prediction tasks without convolutions.
Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer(PVT)
PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions.
arXiv Detail & Related papers (2021-02-24T08:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.