Related papers: Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

URL: http://arxiv.org/abs/2509.03895v1
Date: Thu, 04 Sep 2025 05:42:02 GMT
Title: Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model
Authors: Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo,
Abstract summary: Attn-Adapter is a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism.<n>Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features.<n>Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.
Score: 2.2099003320482393
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

Related papers

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification [0.7746379804154433]
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction.<n>We propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck equipped with multi-head self-attention to model inter-patch dependencies.<n>With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost.
arXiv Detail & Related papers (2026-02-18T16:41:32Z)
Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning [21.093665370734684]
We develop a novel adapter for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks.<n>The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data.<n>The proposed LatHAdapter consistently outperforms many other fine-tuning approaches.
arXiv Detail & Related papers (2025-08-15T03:02:36Z)
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection [39.72202031440292]
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning.<n>Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images.<n>We present a simple yet effective method called AdaptCLIP based on two key insights.
arXiv Detail & Related papers (2025-05-15T03:24:28Z)
HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter [19.557300178619382]
We propose a novel Heterogeneous Graph Adapter to achieve tuning VLMs for the downstream tasks. We employ a specific Heterogeneous Graph Neural Network to excavate multi-modality structure knowledge for the downstream tasks. Experimental results on 11 benchmark datasets demonstrate the effectiveness and benefits of the proposed HeGraphAdapter.
arXiv Detail & Related papers (2024-10-10T12:20:58Z)
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters. To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts. Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples. We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z)
GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph [63.81641578763094]
adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) We propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively.
arXiv Detail & Related papers (2023-09-24T12:56:40Z)
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models. SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [84.88106370842883]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.<n>CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending.<n> Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.