Related papers: SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt

SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt

URL: http://arxiv.org/abs/2312.10376v1
Date: Sat, 16 Dec 2023 08:23:43 GMT
Title: SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt
Authors: Wenjie Pei, Tongqi Xia, Fanglin Chen, Jinsong Li, Jiandong Tian, Guangming Lu
Abstract summary: Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP. Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map. Our model can conduct individual prompting for different image tokens in a fine-grained manner.
Score: 59.280491260635266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As a prominent parameter-efficient fine-tuning technique in NLP, prompt tuning is being explored its potential in computer vision. Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representation as the visual prompts for task adaptation of large vision models. While such sequential modeling paradigm of visual prompt has shown great promise, there are two potential limitations. First, the learned visual prompts cannot model the underlying spatial relations in the input image, which is crucial for image encoding. Second, since all prompt tokens play the same role of prompting for all image tokens without distinction, it lacks the fine-grained prompting capability, i.e., individual prompting for different image tokens. In this work, we propose the \mymodel model (\emph{SA$^2$VP}), which learns a two-dimensional prompt token map with equal (or scaled) size to the image token map, thereby being able to spatially align with the image map. Each prompt token is designated to prompt knowledge only for the spatially corresponding image tokens. As a result, our model can conduct individual prompting for different image tokens in a fine-grained manner. Moreover, benefiting from the capability of preserving the spatial structure by the learned prompt token map, our \emph{SA$^2$VP} is able to model the spatial relations in the input image, leading to more effective prompting. Extensive experiments on three challenging benchmarks for image classification demonstrate the superiority of our model over other state-of-the-art methods for visual prompt tuning. Code is available at \emph{https://github.com/tommy-xq/SA2VP}.

Related papers

Hita: Holistic Tokenizer for Autoregressive Image Generation [56.81871174745175]
We introduce textitHita, a novel image tokenizer for autoregressive (AR) image generation.<n>It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens.
arXiv Detail & Related papers (2025-07-03T06:44:26Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Neighboring Autoregressive Modeling for Efficient Visual Generation [19.486745219466666]
Neighboring Autoregressive Modeling (NAR) is a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure.<n>To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads.<n>Experiments on ImageNet$256times 256$ and UCF101 demonstrate that NAR achieves 2.4$times$ and 8.6$times$ higher throughput respectively.
arXiv Detail & Related papers (2025-03-12T05:52:27Z)
Adaptive Length Image Tokenization via Recurrent Allocation [81.10081670396956]
Current vision systems assign fixed-length representations to images, regardless of the information content. Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
arXiv Detail & Related papers (2024-11-04T18:58:01Z)
Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z)
Tokenize Anything via Prompting [65.93061853439512]
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
arXiv Detail & Related papers (2023-12-14T17:01:02Z)
Rejuvenating image-GPT as Strong Visual Representation Learners [28.77567067712619]
This paper enhances image-GPT, one of the pioneering works that introduce autoregressive pretraining to predict the next pixels. We shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Experiments showcase that D-iGPT excels as a strong learner of visual representations.
arXiv Detail & Related papers (2023-12-04T18:59:20Z)
Make A Long Image Short: Adaptive Token Length for Vision Transformers [5.723085628967456]
We propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
arXiv Detail & Related papers (2023-07-05T08:10:17Z)
Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training. We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training) The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)
Character-Centric Story Visualization via Visual Planning and Token Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story. Key challenge of consistent story visualization is to preserve characters that are essential in stories. We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.