A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
- URL: http://arxiv.org/abs/2602.16590v1
- Date: Wed, 18 Feb 2026 16:41:32 GMT
- Title: A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
- Authors: Qi You, Yitai Cheng, Zichao Zeng, James Haworth,
- Abstract summary: Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction.<n>We propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck equipped with multi-head self-attention to model inter-patch dependencies.<n>With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost.
- Score: 0.7746379804154433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.
Related papers
- Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model [2.2099003320482393]
Attn-Adapter is a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism.<n>Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features.<n>Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.
arXiv Detail & Related papers (2025-09-04T05:42:02Z) - Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning [21.093665370734684]
We develop a novel adapter for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks.<n>The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data.<n>The proposed LatHAdapter consistently outperforms many other fine-tuning approaches.
arXiv Detail & Related papers (2025-08-15T03:02:36Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained
Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs.
Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required.
We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z) - Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [58.06983806317233]
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
arXiv Detail & Related papers (2022-07-19T19:12:11Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language
Modeling [78.62723847797382]
We propose textbfTraining-Free CLtextbfIP-textbfAdapter (textbfTip-Adapter), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter.
We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter.
arXiv Detail & Related papers (2021-11-06T18:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.