LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
- URL: http://arxiv.org/abs/2402.17406v1
- Date: Tue, 27 Feb 2024 10:55:07 GMT
- Title: LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
- Authors: Shentong Mo, Yansen Wang, Xufang Luo, Dongsheng Li
- Abstract summary: Visual Prompt Tuning (VPT) techniques adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts.
We introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning.
Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.
- Score: 36.843950725332476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Prompt Tuning (VPT) techniques have gained prominence for their
capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual
tasks using specialized learnable tokens termed as prompts. Contemporary VPT
methodologies, especially when employed with self-supervised vision
transformers, often default to the introduction of new learnable prompts or
gated prompt tokens predominantly sourced from the model's previous block. A
pivotal oversight in such approaches is their failure to harness the potential
of long-range previous blocks as sources of prompts within each self-supervised
ViT. To bridge this crucial gap, we introduce Long-term Spatial Prompt Tuning
(LSPT) - a revolutionary approach to visual representation learning. Drawing
inspiration from the intricacies of the human brain, LSPT ingeniously
incorporates long-term gated prompts. This feature serves as temporal coding,
curbing the risk of forgetting parameters acquired from earlier blocks. Further
enhancing its prowess, LSPT brings into play patch tokens, serving as spatial
coding. This is strategically designed to perpetually amass class-conscious
features, thereby fortifying the model's prowess in distinguishing and
identifying visual categories. To validate the efficacy of our proposed method,
we engaged in rigorous experimentation across 5 FGVC and 19 VTAB-1K benchmarks.
Our empirical findings underscore the superiority of LSPT, showcasing its
ability to set new benchmarks in visual prompt tuning performance.
Related papers
- Mixture of Experts Meets Prompt-Based Continual Learning [23.376460019465235]
This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning.
We provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism.
The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms.
arXiv Detail & Related papers (2024-05-23T02:49:57Z) - Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training.
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Improving Visual Prompt Tuning for Self-supervised Vision Transformers [29.930641613984438]
Visual Prompt Tuning (VPT) is an effective tuning method for adapting pretrained Vision Transformers (ViTs) to downstream tasks.
We propose a method that learns a gate for each ViT block to adjust its intervention into the prompt tokens.
Our method outperforms VPT variants in FGVC and VTAB image classification and ADE20K semantic segmentation.
arXiv Detail & Related papers (2023-06-08T09:31:28Z) - Progressive Visual Prompt Learning with Contrastive Feature Re-formation [15.385630262368661]
We propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers.
Our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method.
To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.
arXiv Detail & Related papers (2023-04-17T15:54:10Z) - Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.