Related papers: CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task

URL: http://arxiv.org/abs/2408.14961v1
Date: Tue, 27 Aug 2024 11:07:19 GMT
Title: CVPT: Cross-Attention help Visual Prompt Tuning adapt visual task
Authors: Lingyun Huang, Jianxu Mao, Yaonan Wang, Junfei Yi, Ziming Tao,
Abstract summary: Cross Visual Prompt Tuning is a new type of visual fine-tuning. CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them. CVPT significantly improves VPT's performance and efficiency in visual tasks.
Score: 15.642102189777072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the rapid expansion of model sizes has led to large-scale pre-trained models demonstrating remarkable capabilities. Consequently, there has been a trend towards increasing the scale of models. However, this trend introduces significant challenges, including substantial computational costs of training and transfer to downstream tasks. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. These methods optimize large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter's relatively weaker performance and efficiency. Under the circumstances, we refine the widely-used Visual Prompt Tuning (VPT) method, proposing Cross Visual Prompt Tuning (CVPT). CVPT calculates cross-attention between the prompt tokens and the embedded tokens, which allows us to compute the semantic relationship between them and conduct the fine-tuning of models exactly to adapt visual tasks better. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT's performance and efficiency in visual tasks. For example, on the VTAB-1K benchmark, CVPT outperforms VPT over 4% in average accuracy, rivaling the advanced adapter-based methods in performance and efficiency. Our experiments confirm that prompt-based methods can achieve exceptional results in visual fine-tuning.

Related papers

Visual Instance-aware Prompt Tuning [21.538712755298413]
Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers.<n>We propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input.<n>ViaPT overcomes limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters.
arXiv Detail & Related papers (2025-07-10T14:23:15Z)
Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning [27.703316805290843]
Visual Prompt Tuning (VPT) has emerged as a powerful method for adapting pre-trained vision models to downstream tasks. We propose Visual Adaptive Prompt Tuning (VAPT), a new generation of prompts that redefines prompts as adaptive functions of the input. Our theoretical analysis shows that VAPT achieves optimal sample efficiency.
arXiv Detail & Related papers (2025-01-31T07:41:06Z)
Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z)
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation. DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z)
Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training. Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes. Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z)
Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning? [92.23438255540968]
Visual Prompt Tuning is a parameter-efficient transfer learning technique. We conduct a comprehensive analysis across 19 distinct datasets and tasks. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
arXiv Detail & Related papers (2024-01-23T16:48:18Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Visual Prompt Tuning for Test-time Domain Adaptation [48.16620171809511]
We propose a simple recipe called data-efficient prompt tuning (DePT) with two key ingredients. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. With much fewer parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks, but also superior data efficiency.
arXiv Detail & Related papers (2022-10-10T16:45:13Z)
Visual Prompt Tuning [74.5309408185523]
This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen.
arXiv Detail & Related papers (2022-03-23T01:17:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.