Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language
Modeling
- URL: http://arxiv.org/abs/2111.03930v1
- Date: Sat, 6 Nov 2021 18:09:22 GMT
- Title: Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language
Modeling
- Authors: Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng
Dai, Yu Qiao, Hongsheng Li
- Abstract summary: We propose textbfTraining-Free CLtextbfIP-textbfAdapter (textbfTip-Adapter), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter.
We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter.
- Score: 78.62723847797382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has provided a new
paradigm for learning visual representations by using large-scale contrastive
image-text pairs. It shows impressive performance on zero-shot knowledge
transfer to downstream tasks. To further enhance CLIP's few-shot capability,
CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and
significantly improves the performance for few-shot classification. However,
such a process still needs extra training and computational resources. In this
paper, we propose \textbf{T}raining-Free CL\textbf{IP}-\textbf{Adapter}
(\textbf{Tip-Adapter}), which not only inherits CLIP's training-free advantage
but also performs comparably or even better than CLIP-Adapter. Tip-Adapter does
not require any back propagation for training the adapter, but creates the
weights by a key-value cache model constructed from the few-shot training set.
In this non-parametric manner, Tip-Adapter acquires well-performed adapter
weights without any training, which is both efficient and effective. Moreover,
the performance of Tip-Adapter can be further boosted by fine-tuning such
properly initialized adapter for only a few epochs with super-fast convergence
speed. We conduct extensive experiments of few-shot classification on ImageNet
and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter.
The code will be released at \url{https://github.com/gaopengcuhk/Tip-Adapter}.
Related papers
- Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - MerA: Merging Pretrained Adapters For Few-Shot Learning [71.44422347502409]
We propose textbftextttMerging Pretrained Adapters (MerA) that efficiently incorporates pretrained adapters to a single model through model fusion.
Experiments on two PLMs demonstrate that MerA substantial improvements compared to both single adapters and AdapterFusion.
arXiv Detail & Related papers (2023-08-30T12:10:17Z) - SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency
of Adapters [96.52807311742198]
We re-examine the parameter-efficiency of Adapters through the lens of network pruning.
We find that SparseAdapter can achieve comparable or better performance than standard Adapters when the sparse ratio reaches up to 80%.
arXiv Detail & Related papers (2022-10-09T15:28:48Z) - SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained
Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs.
Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required.
We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z) - Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [58.06983806317233]
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
arXiv Detail & Related papers (2022-07-19T19:12:11Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.