Facing the Elephant in the Room: Visual Prompt Tuning or Full
Finetuning?
- URL: http://arxiv.org/abs/2401.12902v1
- Date: Tue, 23 Jan 2024 16:48:18 GMT
- Title: Facing the Elephant in the Room: Visual Prompt Tuning or Full
Finetuning?
- Authors: Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan
Qi, Dongfang Liu
- Abstract summary: Visual Prompt Tuning is a parameter-efficient transfer learning technique.
We conduct a comprehensive analysis across 19 distinct datasets and tasks.
Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
- Score: 92.23438255540968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the scale of vision models continues to grow, the emergence of Visual
Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has
gained attention due to its superior performance compared to traditional
full-finetuning. However, the conditions favoring VPT (the ``when") and the
underlying rationale (the ``why") remain unclear. In this paper, we conduct a
comprehensive analysis across 19 distinct datasets and tasks. To understand the
``when" aspect, we identify the scenarios where VPT proves favorable by two
dimensions: task objectives and data distributions. We find that VPT is
preferrable when there is 1) a substantial disparity between the original and
the downstream task objectives (e.g., transitioning from classification to
counting), or 2) a similarity in data distributions between the two tasks
(e.g., both involve natural images). In exploring the ``why" dimension, our
results indicate VPT's success cannot be attributed solely to overfitting and
optimization considerations. The unique way VPT preserves original features and
adds parameters appears to be a pivotal factor. Our study provides insights
into VPT's mechanisms, and offers guidance for its optimal utilization.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task [15.642102189777072]
Cross Visual Prompt Tuning is a new type of visual fine-tuning.
CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them.
CVPT significantly improves VPT's performance and efficiency in visual tasks.
arXiv Detail & Related papers (2024-08-27T11:07:19Z) - What Matters When Repurposing Diffusion Models for General Dense Perception Tasks? [49.84679952948808]
Recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks.
We conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors.
Our work culminates in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks.
arXiv Detail & Related papers (2024-03-10T04:23:24Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Visual Prompt Tuning [74.5309408185523]
This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision.
VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen.
arXiv Detail & Related papers (2022-03-23T01:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.