Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
- URL: http://arxiv.org/abs/2207.09519v1
- Date: Tue, 19 Jul 2022 19:12:11 GMT
- Title: Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification
- Authors: Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng
Dai, Yu Qiao, Hongsheng Li
- Abstract summary: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
- Score: 58.06983806317233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new
paradigm for learning visual representations using large-scale image-text
pairs. It shows impressive performance on downstream tasks by zero-shot
knowledge transfer. To further enhance CLIP's adaption capability, existing
methods proposed to fine-tune additional learnable modules, which significantly
improves the few-shot performance but introduces extra training time and
computational resources. In this paper, we propose a training-free adaption
method for CLIP to conduct few-shot classification, termed as Tip-Adapter,
which not only inherits the training-free advantage of zero-shot CLIP but also
performs comparably to those training-required approaches. Tip-Adapter
constructs the adapter via a key-value cache model from the few-shot training
set, and updates the prior knowledge encoded in CLIP by feature retrieval. On
top of that, the performance of Tip-Adapter can be further boosted to be
state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$
fewer epochs than existing methods, which is both effective and efficient. We
conduct extensive experiments of few-shot classification on 11 datasets to
demonstrate the superiority of our proposed methods. Code is released at
https://github.com/gaopengcuhk/Tip-Adapter.
Related papers
- ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models [8.66217922377209]
Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks.
In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters.
We propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space using CLIP as a base learner.
arXiv Detail & Related papers (2025-01-19T21:25:53Z) - IDEA: Image Description Enhanced CLIP-Adapter [23.446016867479138]
We propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks.
IDEA captures fine-grained features by leveraging both visual features and textual descriptions of images.
As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets.
arXiv Detail & Related papers (2025-01-15T14:12:59Z) - Adapter-Enhanced Semantic Prompting for Continual Learning [91.63494614012362]
Continual learning (CL) enables models to adapt to evolving data streams.
Traditional methods usually retain the past data for replay or add additional branches in the model to learn new knowledge.
We propose a novel lightweight CL framework, which integrates prompt tuning and adapter techniques.
arXiv Detail & Related papers (2024-12-15T06:14:55Z) - Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia [45.93202559299953]
This paper introduces an alternative way for CLIP adaptation without adding 'external' parameters to optimize.
We find that simply fine-tuning the last projection matrix of the vision leads to performance better than all baselines.
This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks.
arXiv Detail & Related papers (2024-10-07T17:59:59Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language
Modeling [78.62723847797382]
We propose textbfTraining-Free CLtextbfIP-textbfAdapter (textbfTip-Adapter), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter.
We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter.
arXiv Detail & Related papers (2021-11-06T18:09:22Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.