Related papers: Vision Transformer Adapter for Dense Predictions

Vision Transformer Adapter for Dense Predictions

URL: http://arxiv.org/abs/2205.08534v2
Date: Wed, 18 May 2022 01:27:12 GMT
Title: Vision Transformer Adapter for Dense Predictions
Authors: Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao
Abstract summary: Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images. We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
Score: 57.590511173416445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

Related papers

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions [4.554319452683839]
Vision Transformer (ViT) has achieved significant success in computer vision, but does not perform well in dense prediction tasks. We present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer. We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features.
arXiv Detail & Related papers (2024-03-12T07:59:41Z)
Mini but Mighty: Finetuning ViTs with Mini Adapters [7.175668563148084]
adapters perform poorly when the dimension of adapters is small. We propose MiMi, a training framework that addresses this issue. Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters.
arXiv Detail & Related papers (2023-11-07T10:41:27Z)
Selective Feature Adapter for Dense Vision Transformers [30.409313135985528]
selective feature adapter (SFA) achieves comparable or better performance than fully fine-tuned models across various dense tasks. SFA consists of external adapters and internal adapters which are sequentially operated over a transformer model. Experiments show that the dual adapter module, a.k.a SFA, is essential to achieve the best trade-off on dense vision tasks.
arXiv Detail & Related papers (2023-10-03T07:17:58Z)
$E(2)$-Equivariant Vision Transformer [11.94180035256023]
Vision Transformer (ViT) has achieved remarkable performance in computer vision. positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. We design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding operator.
arXiv Detail & Related papers (2023-06-11T16:48:03Z)
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition [39.443380221227166]
We propose an effective adaptation approach for Transformer, namely AdaptFormer. It can adapt the pre-trained ViTs into many different image and video tasks efficiently. It is able to increase the ViT's transferability without updating its original pre-trained parameters.
arXiv Detail & Related papers (2022-05-26T17:56:15Z)
ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [76.35955924137986]
We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets. Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
arXiv Detail & Related papers (2022-04-26T17:55:04Z)
An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector. We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z)
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [71.40656211497162]
Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks. We introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5. Our results demonstrate that training the adapter with the weight-sharing technique can match the performance of fine-tuning the entire model.
arXiv Detail & Related papers (2021-12-13T17:35:26Z)
ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection. vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z)
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [103.03973037619532]
This work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer(PVT) PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions.
arXiv Detail & Related papers (2021-02-24T08:33:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.