KNN Transformer with Pyramid Prompts for Few-Shot Learning
- URL: http://arxiv.org/abs/2410.10227v1
- Date: Mon, 14 Oct 2024 07:39:30 GMT
- Title: KNN Transformer with Pyramid Prompts for Few-Shot Learning
- Authors: Wenhao Li, Qiangchang Wang, Peng Zhao, Yilong Yin,
- Abstract summary: Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
- Score: 52.735070934075736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.
Related papers
- Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets.
ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification.
We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.
AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework.
We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - Dissecting Query-Key Interaction in Vision Transformers [4.743574336827573]
Self-attention in vision transformers is often thought to perform perceptual grouping.
We analyze the query-key interaction by the singular value decomposition of the interaction matrix.
arXiv Detail & Related papers (2024-04-04T20:06:07Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.