Improving Visual Prompt Tuning for Self-supervised Vision Transformers
- URL: http://arxiv.org/abs/2306.05067v1
- Date: Thu, 8 Jun 2023 09:31:28 GMT
- Title: Improving Visual Prompt Tuning for Self-supervised Vision Transformers
- Authors: Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, Sungroh Yoon
- Abstract summary: Visual Prompt Tuning (VPT) is an effective tuning method for adapting pretrained Vision Transformers (ViTs) to downstream tasks.
We propose a method that learns a gate for each ViT block to adjust its intervention into the prompt tokens.
Our method outperforms VPT variants in FGVC and VTAB image classification and ADE20K semantic segmentation.
- Score: 29.930641613984438
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual Prompt Tuning (VPT) is an effective tuning method for adapting
pretrained Vision Transformers (ViTs) to downstream tasks. It leverages extra
learnable tokens, known as prompts, which steer the frozen pretrained ViTs.
Although VPT has demonstrated its applicability with supervised vision
transformers, it often underperforms with self-supervised ones. Through
empirical observations, we deduce that the effectiveness of VPT hinges largely
on the ViT blocks with which the prompt tokens interact. Specifically, VPT
shows improved performance on image classification tasks for MAE and MoCo v3
when the prompt tokens are inserted into later blocks rather than the first
block. These observations suggest that there exists an optimal location of
blocks for the insertion of prompt tokens. Unfortunately, identifying the
optimal blocks for prompts within each self-supervised ViT for diverse future
scenarios is a costly process. To mitigate this problem, we propose a simple
yet effective method that learns a gate for each ViT block to adjust its
intervention into the prompt tokens. With our method, prompt tokens are
selectively influenced by blocks that require steering for task adaptation. Our
method outperforms VPT variants in FGVC and VTAB image classification and
ADE20K semantic segmentation. The code is available at
https://github.com/ryongithub/GatedPromptTuning.
Related papers
- Selective Visual Prompting in Vision Mamba [35.86547398432339]
Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks.
Existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models.
We introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim.
arXiv Detail & Related papers (2024-12-12T05:24:06Z) - Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking [11.361394596302334]
ABTrack is an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking.
We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed.
We introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block.
arXiv Detail & Related papers (2024-06-12T09:39:18Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning [36.843950725332476]
Visual Prompt Tuning (VPT) techniques adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts.
We introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning.
Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.
arXiv Detail & Related papers (2024-02-27T10:55:07Z) - Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training.
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Learning Expressive Prompting With Residuals for Vision Transformers [11.342913284654706]
We present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT)
We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark.
arXiv Detail & Related papers (2023-03-27T20:47:01Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.