Visual Instance-aware Prompt Tuning
- URL: http://arxiv.org/abs/2507.07796v1
- Date: Thu, 10 Jul 2025 14:23:15 GMT
- Title: Visual Instance-aware Prompt Tuning
- Authors: Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, Min Xu,
- Abstract summary: Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers.<n>We propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input.<n>ViaPT overcomes limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters.
- Score: 21.538712755298413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.
Related papers
- DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers [13.964106147449051]
We leverage metric learning techniques to investigate how the distribution of prompts affects fine-tuning performance.<n>We propose a novel framework, Distribution Aware Visual Prompt Tuning (DA-VPT), to guide the distributions of the prompts.<n>Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token.
arXiv Detail & Related papers (2025-05-29T17:31:26Z) - Visual Variational Autoencoder Prompt Tuning [20.387933505896388]
This paper introduces V$2$APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts.<n>Experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods.
arXiv Detail & Related papers (2025-03-22T04:59:51Z) - On the Expressiveness of Visual Prompt Experts [27.283335463524576]
Visual Prompt Tuning (VPT) has proven effective for parameter-efficient adaptation of pre-trained vision models to downstream tasks by inserting task-specific learnable prompt tokens.<n>We propose Visual Adaptive Prompt Tuning (VAPT), a novel method that endows prompt experts with enhanced expressiveness while preserving parameter efficiency.
arXiv Detail & Related papers (2025-01-31T07:41:06Z) - Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - CVPT: Cross Visual Prompt Tuning [15.642102189777072]
Cross Visual Prompt Tuning (CVPT) is a cross-attention module to model interactions between prompts and image tokens.<n>CVPT achieves over 4% higher average accuracy, rivaling leading adapter-based methods in both performance and efficiency.<n>Our work confirms that prompt-based methods can achieve exceptional results in visual fine-tuning.
arXiv Detail & Related papers (2024-08-27T11:07:19Z) - Facing the Elephant in the Room: Visual Prompt Tuning or Full
Finetuning? [92.23438255540968]
Visual Prompt Tuning is a parameter-efficient transfer learning technique.
We conduct a comprehensive analysis across 19 distinct datasets and tasks.
Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
arXiv Detail & Related papers (2024-01-23T16:48:18Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Explicit Visual Prompting for Low-Level Structure Segmentations [55.51869354956533]
We propose a new visual prompting model, named Explicit Visual Prompting (EVP)
EVP significantly outperforms other parameter-efficient tuning protocols under the same amount of tunable parameters.
EVP also achieves state-of-the-art performances on diverse low-level structure segmentation tasks.
arXiv Detail & Related papers (2023-03-20T06:01:53Z) - Diversity-Aware Meta Visual Prompting [111.75306320834629]
We present Diversity-Aware Meta Visual Prompting(DAM-VP), an efficient prompting method for transferring pre-trained models to downstream tasks with frozen backbone.
We cluster the downstream dataset into small subsets in a diversity-strapped way, with each subset has its own prompt separately.
All the prompts are optimized with a meta-prompt, which is learned across several datasets.
arXiv Detail & Related papers (2023-03-14T17:59:59Z) - Unified Vision and Language Prompt Learning [86.1530128487077]
We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning.
A major finding is that text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances.
To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities.
arXiv Detail & Related papers (2022-10-13T17:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.