Selective Visual Prompting in Vision Mamba
- URL: http://arxiv.org/abs/2412.08947v1
- Date: Thu, 12 Dec 2024 05:24:06 GMT
- Title: Selective Visual Prompting in Vision Mamba
- Authors: Yifeng Yao, Zichen Liu, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou,
- Abstract summary: Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks.
Existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models.
We introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim.
- Score: 35.86547398432339
- License:
- Abstract: Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods. Our code is available at https://github.com/zhoujiahuan1991/AAAI2025-SVP.
Related papers
- LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation [41.77434289193232]
We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP)
LoR-VP enables shared and patch-specific information across rows and columns of image pixels.
Experiments demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods.
arXiv Detail & Related papers (2025-02-02T20:10:48Z) - Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning [29.19130646630545]
We introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work.
MIP is designed to effectively utilize both invariant and specific information for identification.
Our proposed MIP performs better than most state-of-the-art methods.
arXiv Detail & Related papers (2024-06-18T06:39:03Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - Progressive Multi-modal Conditional Prompt Tuning [92.50645776024624]
Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting.
We propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT)
ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information.
arXiv Detail & Related papers (2024-04-18T02:40:31Z) - iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection [34.20778042463112]
We propose a novel visual prompt tuning (VPT) approach, textbfiVPT.
It incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information.
Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens.
arXiv Detail & Related papers (2024-04-08T05:23:12Z) - Distribution-Aware Prompt Tuning for Vision-Language Models [20.02599087680773]
A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed.
Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models.
Our experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability.
arXiv Detail & Related papers (2023-09-06T23:49:11Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Progressive Visual Prompt Learning with Contrastive Feature Re-formation [15.385630262368661]
We propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers.
Our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method.
To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.
arXiv Detail & Related papers (2023-04-17T15:54:10Z) - Explicit Visual Prompting for Low-Level Structure Segmentations [55.51869354956533]
We propose a new visual prompting model, named Explicit Visual Prompting (EVP)
EVP significantly outperforms other parameter-efficient tuning protocols under the same amount of tunable parameters.
EVP also achieves state-of-the-art performances on diverse low-level structure segmentation tasks.
arXiv Detail & Related papers (2023-03-20T06:01:53Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.