Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
- URL: http://arxiv.org/abs/2207.09519v1
- Date: Tue, 19 Jul 2022 19:12:11 GMT
- Title: Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
- Authors: Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng
Dai, Yu Qiao, Hongsheng Li
- Abstract summary: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
- Score: 58.06983806317233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new
paradigm for learning visual representations using large-scale image-text
pairs. It shows impressive performance on downstream tasks by zero-shot
knowledge transfer. To further enhance CLIP's adaption capability, existing
methods proposed to fine-tune additional learnable modules, which significantly
improves the few-shot performance but introduces extra training time and
computational resources. In this paper, we propose a training-free adaption
method for CLIP to conduct few-shot classification, termed as Tip-Adapter,
which not only inherits the training-free advantage of zero-shot CLIP but also
performs comparably to those training-required approaches. Tip-Adapter
constructs the adapter via a key-value cache model from the few-shot training
set, and updates the prior knowledge encoded in CLIP by feature retrieval. On
top of that, the performance of Tip-Adapter can be further boosted to be
state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$
fewer epochs than existing methods, which is both effective and efficient. We
conduct extensive experiments of few-shot classification on 11 datasets to
demonstrate the superiority of our proposed methods. Code is released at
https://github.com/gaopengcuhk/Tip-Adapter.
Related papers
- Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia [45.93202559299953]
This paper introduces an alternative way for CLIP adaptation without adding 'external' parameters to optimize.
We find that simply fine-tuning the last projection matrix of the vision leads to strong performance compared to the existing baselines.
Perhaps surprisingly, this approach, coined ProLIP, yields performances on par or better than state of the art on 11 few-shot classification benchmarks.
arXiv Detail & Related papers (2024-10-07T17:59:59Z) - A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity.
Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks.
We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior
Refinement [24.108008515395458]
We propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency.
For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
arXiv Detail & Related papers (2023-04-03T17:58:54Z) - Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language
Modeling [78.62723847797382]
We propose textbfTraining-Free CLtextbfIP-textbfAdapter (textbfTip-Adapter), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter.
We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter.
arXiv Detail & Related papers (2021-11-06T18:09:22Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.