SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt
- URL: http://arxiv.org/abs/2312.10376v1
- Date: Sat, 16 Dec 2023 08:23:43 GMT
- Title: SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt
- Authors: Wenjie Pei, Tongqi Xia, Fanglin Chen, Jinsong Li, Jiandong Tian,
Guangming Lu
- Abstract summary: Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
- Score: 59.280491260635266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a prominent parameter-efficient fine-tuning technique in NLP, prompt
tuning is being explored its potential in computer vision. Typical methods for
visual prompt tuning follow the sequential modeling paradigm stemming from NLP,
which represents an input image as a flattened sequence of token embeddings and
then learns a set of unordered parameterized tokens prefixed to the sequence
representation as the visual prompts for task adaptation of large vision
models. While such sequential modeling paradigm of visual prompt has shown
great promise, there are two potential limitations. First, the learned visual
prompts cannot model the underlying spatial relations in the input image, which
is crucial for image encoding. Second, since all prompt tokens play the same
role of prompting for all image tokens without distinction, it lacks the
fine-grained prompting capability, i.e., individual prompting for different
image tokens. In this work, we propose the \mymodel model (\emph{SA$^2$VP}),
which learns a two-dimensional prompt token map with equal (or scaled) size to
the image token map, thereby being able to spatially align with the image map.
Each prompt token is designated to prompt knowledge only for the spatially
corresponding image tokens. As a result, our model can conduct individual
prompting for different image tokens in a fine-grained manner. Moreover,
benefiting from the capability of preserving the spatial structure by the
learned prompt token map, our \emph{SA$^2$VP} is able to model the spatial
relations in the input image, leading to more effective prompting. Extensive
experiments on three challenging benchmarks for image classification
demonstrate the superiority of our model over other state-of-the-art methods
for visual prompt tuning. Code is available at
\emph{https://github.com/tommy-xq/SA2VP}.
Related papers
- Adaptive Length Image Tokenization via Recurrent Allocation [81.10081670396956]
Current vision systems assign fixed-length representations to images, regardless of the information content.
Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
arXiv Detail & Related papers (2024-11-04T18:58:01Z) - Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - Tokenize Anything via Prompting [65.93061853439512]
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything.
We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters.
We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
arXiv Detail & Related papers (2023-12-14T17:01:02Z) - Rejuvenating image-GPT as Strong Visual Representation Learners [28.77567067712619]
This paper enhances image-GPT, one of the pioneering works that introduce autoregressive pretraining to predict the next pixels.
We shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content.
Experiments showcase that D-iGPT excels as a strong learner of visual representations.
arXiv Detail & Related papers (2023-12-04T18:59:20Z) - Make A Long Image Short: Adaptive Token Length for Vision Transformers [5.723085628967456]
We propose an innovative approach to accelerate the ViT model by shortening long images.
Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
arXiv Detail & Related papers (2023-07-05T08:10:17Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.