iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection
- URL: http://arxiv.org/abs/2404.05207v1
- Date: Mon, 8 Apr 2024 05:23:12 GMT
- Title: iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection
- Authors: Nan Zhou, Jiaxin Chen, Di Huang,
- Abstract summary: We propose a novel visual prompt tuning (VPT) approach, textbfiVPT.
It incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information.
Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens.
- Score: 34.20778042463112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can do harm to the sharing of task-relevant information. In this paper, we propose a novel VPT approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens, which are further enhanced by prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantage of the proposed iVPT, compared to the state-of-the-art counterparts.
Related papers
- Selective Visual Prompting in Vision Mamba [35.86547398432339]
Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks.
Existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models.
We introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim.
arXiv Detail & Related papers (2024-12-12T05:24:06Z) - KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - Enhancing Graph Contrastive Learning with Reliable and Informative Augmentation for Recommendation [84.45144851024257]
We propose a novel framework that aims to enhance graph contrastive learning by constructing contrastive views with stronger collaborative information via discrete codes.
The core idea is to map users and items into discrete codes rich in collaborative information for reliable and informative contrastive view generation.
arXiv Detail & Related papers (2024-09-09T14:04:17Z) - Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation [90.71613903956451]
Text-to-image retrieval is a fundamental task in multimedia processing.
We propose an autoregressive voken generation method, named AVG.
We show that AVG achieves superior results in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-24T13:39:51Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - ECAFormer: Low-light Image Enhancement using Cross Attention [11.554554006307836]
Low-light image enhancement (LLIE) is critical in computer vision.
We design a hierarchical mutual Enhancement via a Cross Attention transformer (ECAFormer)
We show that ECAFormer reaches competitive performance across multiple benchmarks, yielding nearly a 3% improvement in PSNR over the suboptimal method.
arXiv Detail & Related papers (2024-06-19T07:21:31Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z) - Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based
Image Retrieval [55.29233996427243]
Low-shot sketch-based image retrieval is an emerging task in computer vision.
In this paper, we address any-shot, i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks.
For solving these tasks, we propose a semantically aligned cycle-consistent generative adversarial network (SEM-PCYC)
Our results demonstrate a significant boost in any-shot performance over the state-of-the-art on the extended version of the Sketchy, TU-Berlin and QuickDraw datasets.
arXiv Detail & Related papers (2020-06-20T22:43:53Z) - Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary.
We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.