Attention to the Burstiness in Visual Prompt Tuning!
- URL: http://arxiv.org/abs/2506.22908v2
- Date: Mon, 18 Aug 2025 02:11:06 GMT
- Title: Attention to the Burstiness in Visual Prompt Tuning!
- Authors: Yuzhu Wang, Manni Duan, Shu Kong,
- Abstract summary: Visual Prompt Tuning (VPT) is a fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts.<n>In VPT, we uncover burstiness'' in the values arising from the interaction of image patch embeddings.<n>We propose whitening data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts.
- Score: 10.857651069130979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover ``burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner. Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 accuracy points on the CUB dataset; interestingly, it learns ``bursty prompts''. Extending the bilinear model which is known to introduce burstiness, we present a compact, low-rank version by learning two smaller matrices whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments across multiple benchmark datasets demonstrate that BPT methods not only outperform various VPT methods but also reduce parameter count and computation overhead.
Related papers
- DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers [13.964106147449051]
We leverage metric learning techniques to investigate how the distribution of prompts affects fine-tuning performance.<n>We propose a novel framework, Distribution Aware Visual Prompt Tuning (DA-VPT), to guide the distributions of the prompts.<n>Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token.
arXiv Detail & Related papers (2025-05-29T17:31:26Z) - PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation [53.32478229070946]
We introduce adaptive distribution optimization (ADO) by tackling two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition.<n>We propose a new VPT framework termed PRO-VPT, which adaptively adjusts the distribution built upon a nested optimization formulation.<n>Our proposal can adaptively learn the optimal prompt distribution in a nested optimization-based manner, thereby unlocking the full potential of VPT.
arXiv Detail & Related papers (2025-03-10T04:07:43Z) - On the Expressiveness of Visual Prompt Experts [27.283335463524576]
Visual Prompt Tuning (VPT) has proven effective for parameter-efficient adaptation of pre-trained vision models to downstream tasks by inserting task-specific learnable prompt tokens.<n>We propose Visual Adaptive Prompt Tuning (VAPT), a novel method that endows prompt experts with enhanced expressiveness while preserving parameter efficiency.
arXiv Detail & Related papers (2025-01-31T07:41:06Z) - Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained
Models for Spatiotemporal Modeling [32.603558214472265]
We introduce Attention Prompt Tuning (APT) for video-based applications such as action recognition.
APT involves injecting a set of learnable prompts along with data tokens during fine-tuning while keeping the backbone frozen.
The proposed approach greatly reduces the number of FLOPs and latency while achieving a significant performance boost.
arXiv Detail & Related papers (2024-03-11T17:59:41Z) - LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning [36.843950725332476]
Visual Prompt Tuning (VPT) techniques adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts.
We introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning.
Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.
arXiv Detail & Related papers (2024-02-27T10:55:07Z) - Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training.
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z) - Distribution-Aware Prompt Tuning for Vision-Language Models [20.02599087680773]
A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed.
Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models.
Our experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability.
arXiv Detail & Related papers (2023-09-06T23:49:11Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Do We Really Need a Large Number of Visual Prompts? [23.85637456240694]
We analyze the impact of the number of prompts on fine-tuning performance and self-attention operation in a vision transformer architecture.
We propose a Prompt Condensation (PC) technique that aims to prevent performance degradation from using a small number of prompts.
arXiv Detail & Related papers (2023-05-26T19:31:57Z) - All Roads Lead to Rome? Exploring the Invariance of Transformers'
Representations [69.3461199976959]
We propose a model based on invertible neural networks, BERT-INN, to learn the Bijection Hypothesis.
We show the advantage of BERT-INN both theoretically and through extensive experiments.
arXiv Detail & Related papers (2023-05-23T22:30:43Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.