Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning
- URL: http://arxiv.org/abs/2309.06123v1
- Date: Tue, 12 Sep 2023 10:47:37 GMT
- Title: Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning
- Authors: Chunqing Ruan, Hongjian Wang
- Abstract summary: We propose a Dynamic Visual Prompt Tuning framework (DVPT), which can generate a dynamic instance-wise token for each image.
In this way, it can capture the unique visual feature of each image, which can be more suitable for downstream visual tasks.
Experiments on a wide range of downstream recognition tasks show that DVPT achieves superior performance than other PETL methods.
- Score: 0.8430481660019451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Parameter efficient transfer learning (PETL) is an emerging research spot
that aims to adapt large-scale pre-trained models to downstream tasks. Recent
advances have achieved great success in saving storage and computation costs.
However, these methods do not take into account instance-specific visual clues
for visual tasks. In this paper, we propose a Dynamic Visual Prompt Tuning
framework (DVPT), which can generate a dynamic instance-wise token for each
image. In this way, it can capture the unique visual feature of each image,
which can be more suitable for downstream visual tasks. We designed a Meta-Net
module that can generate learnable prompts based on each image, thereby
capturing dynamic instance-wise visual features. Extensive experiments on a
wide range of downstream recognition tasks show that DVPT achieves superior
performance than other PETL methods. More importantly, DVPT even outperforms
full fine-tuning on 17 out of 19 downstream tasks while maintaining high
parameter efficiency. Our code will be released soon.
Related papers
- Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs.
We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities.
Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z) - LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation [41.77434289193232]
We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP)
LoR-VP enables shared and patch-specific information across rows and columns of image pixels.
Experiments demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods.
arXiv Detail & Related papers (2025-02-02T20:10:48Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene
Classification [15.780372479483235]
PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models.
We propose the Meta Visual Prompt Tuning (MVP) method, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen.
We introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes.
arXiv Detail & Related papers (2023-09-17T13:51:05Z) - Explicit Visual Prompting for Universal Foreground Segmentations [55.51869354956533]
We present a unified framework for a number of foreground segmentation tasks without any task-specific designs.
We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP.
Our method freezes a pre-trained model and then learns task-specific knowledge using a few extra parameters.
arXiv Detail & Related papers (2023-05-29T11:05:01Z) - Dynamic Prompting: A Unified Framework for Prompt Tuning [33.175097465669374]
We present a unified dynamic prompt (DP) tuning strategy that dynamically determines different factors of prompts based on specific tasks and instances.
Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks.
We establish the universal applicability of our approach under full-data, few-shot, and multitask scenarios.
arXiv Detail & Related papers (2023-03-06T06:04:46Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Parameter-Efficient Image-to-Video Transfer Learning [66.82811235484607]
Large pre-trained models for various downstream tasks of interest have recently emerged with promising performance.
Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes costly in terms of model training and storage.
We propose a new Spatio-Adapter for parameter-efficient fine-tuning per video task.
arXiv Detail & Related papers (2022-06-27T18:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.