Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model
- URL: http://arxiv.org/abs/2509.03895v1
- Date: Thu, 04 Sep 2025 05:42:02 GMT
- Title: Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model
- Authors: Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo,
- Abstract summary: Attn-Adapter is a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism.<n>Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features.<n>Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.
- Score: 2.2099003320482393
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.
Related papers
- A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification [0.7746379804154433]
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction.<n>We propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck equipped with multi-head self-attention to model inter-patch dependencies.<n>With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost.
arXiv Detail & Related papers (2026-02-18T16:41:32Z) - Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning [21.093665370734684]
We develop a novel adapter for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks.<n>The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data.<n>The proposed LatHAdapter consistently outperforms many other fine-tuning approaches.
arXiv Detail & Related papers (2025-08-15T03:02:36Z) - AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection [39.72202031440292]
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning.<n>Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images.<n>We present a simple yet effective method called AdaptCLIP based on two key insights.
arXiv Detail & Related papers (2025-05-15T03:24:28Z) - HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter [19.557300178619382]
We propose a novel Heterogeneous Graph Adapter to achieve tuning VLMs for the downstream tasks.
We employ a specific Heterogeneous Graph Neural Network to excavate multi-modality structure knowledge for the downstream tasks.
Experimental results on 11 benchmark datasets demonstrate the effectiveness and benefits of the proposed HeGraphAdapter.
arXiv Detail & Related papers (2024-10-10T12:20:58Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph [63.81641578763094]
adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs)
We propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge.
In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively.
arXiv Detail & Related papers (2023-09-24T12:56:40Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [84.88106370842883]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.<n>CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending.<n> Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.