Improving Visual Prompt Tuning for Self-supervised Vision Transformers
- URL: http://arxiv.org/abs/2306.05067v1
- Date: Thu, 8 Jun 2023 09:31:28 GMT
- Title: Improving Visual Prompt Tuning for Self-supervised Vision Transformers
- Authors: Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, Sungroh Yoon
- Abstract summary: Visual Prompt Tuning (VPT) is an effective tuning method for adapting pretrained Vision Transformers (ViTs) to downstream tasks.
We propose a method that learns a gate for each ViT block to adjust its intervention into the prompt tokens.
Our method outperforms VPT variants in FGVC and VTAB image classification and ADE20K semantic segmentation.
- Score: 29.930641613984438
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual Prompt Tuning (VPT) is an effective tuning method for adapting
pretrained Vision Transformers (ViTs) to downstream tasks. It leverages extra
learnable tokens, known as prompts, which steer the frozen pretrained ViTs.
Although VPT has demonstrated its applicability with supervised vision
transformers, it often underperforms with self-supervised ones. Through
empirical observations, we deduce that the effectiveness of VPT hinges largely
on the ViT blocks with which the prompt tokens interact. Specifically, VPT
shows improved performance on image classification tasks for MAE and MoCo v3
when the prompt tokens are inserted into later blocks rather than the first
block. These observations suggest that there exists an optimal location of
blocks for the insertion of prompt tokens. Unfortunately, identifying the
optimal blocks for prompts within each self-supervised ViT for diverse future
scenarios is a costly process. To mitigate this problem, we propose a simple
yet effective method that learns a gate for each ViT block to adjust its
intervention into the prompt tokens. With our method, prompt tokens are
selectively influenced by blocks that require steering for task adaptation. Our
method outperforms VPT variants in FGVC and VTAB image classification and
ADE20K semantic segmentation. The code is available at
https://github.com/ryongithub/GatedPromptTuning.
Related papers
- Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement [17.496082209866923]
We refine two key modules of ViTs: attention maps and token embeddings.
For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial vanishing during backward propagation.
We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating our FPR outperforms the current best (backward) surrogate refinement by up to 7.0% on average.
arXiv Detail & Related papers (2025-03-19T16:44:23Z) - Selective Visual Prompting in Vision Mamba [35.86547398432339]
Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks.
Existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models.
We introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim.
arXiv Detail & Related papers (2024-12-12T05:24:06Z) - Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking [11.361394596302334]
ABTrack is an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking.
We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed.
We introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block.
arXiv Detail & Related papers (2024-06-12T09:39:18Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning [36.843950725332476]
Visual Prompt Tuning (VPT) techniques adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts.
We introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning.
Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.
arXiv Detail & Related papers (2024-02-27T10:55:07Z) - Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training.
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers [14.787864686489032]
We introduce a conditional gating mechanism that selects the optimal token scale for every image region.
We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level.
In contrast to token pruning, MSViT does not lose information about the input, thus can be readily applied for dense tasks.
arXiv Detail & Related papers (2023-07-05T14:22:31Z) - Learning Expressive Prompting With Residuals for Vision Transformers [11.342913284654706]
We present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT)
We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark.
arXiv Detail & Related papers (2023-03-27T20:47:01Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.