Related papers: Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

URL: http://arxiv.org/abs/2207.09519v1
Date: Tue, 19 Jul 2022 19:12:11 GMT
Title: Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
Authors: Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li
Abstract summary: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules. We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
Score: 58.06983806317233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. It shows impressive performance on downstream tasks by zero-shot knowledge transfer. To further enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules, which significantly improves the few-shot performance but introduces extra training time and computational resources. In this paper, we propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter, which not only inherits the training-free advantage of zero-shot CLIP but also performs comparably to those training-required approaches. Tip-Adapter constructs the adapter via a key-value cache model from the few-shot training set, and updates the prior knowledge encoded in CLIP by feature retrieval. On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient. We conduct extensive experiments of few-shot classification on 11 datasets to demonstrate the superiority of our proposed methods. Code is released at https://github.com/gaopengcuhk/Tip-Adapter.

Related papers

IDEA: Image Description Enhanced CLIP-Adapter [23.446016867479138]
We propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. IDEA captures fine-grained features by leveraging both visual features and textual descriptions of images. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets.
arXiv Detail & Related papers (2025-01-15T14:12:59Z)
Adapter-Enhanced Semantic Prompting for Continual Learning [91.63494614012362]
Continual learning (CL) enables models to adapt to evolving data streams. Traditional methods usually retain the past data for replay or add additional branches in the model to learn new knowledge. We propose a novel lightweight CL framework, which integrates prompt tuning and adapter techniques.
arXiv Detail & Related papers (2024-12-15T06:14:55Z)
Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia [45.93202559299953]
This paper introduces an alternative way for CLIP adaptation without adding 'external' parameters to optimize. We find that simply fine-tuning the last projection matrix of the vision leads to strong performance compared to the existing baselines. Perhaps surprisingly, this approach, coined ProLIP, yields performances on par or better than state of the art on 11 few-shot classification benchmarks.
arXiv Detail & Related papers (2024-10-07T17:59:59Z)
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks. We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z)
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts. Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples. We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z)
Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z)
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement [24.108008515395458]
We propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
arXiv Detail & Related papers (2023-04-03T17:58:54Z)
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling [78.62723847797382]
We propose textbfTraining-Free CLtextbfIP-textbfAdapter (textbfTip-Adapter), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter. We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter.
arXiv Detail & Related papers (2021-11-06T18:09:22Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning. In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.