CLIP-Adapter: Better Vision-Language Models with Feature Adapters
- URL: http://arxiv.org/abs/2110.04544v1
- Date: Sat, 9 Oct 2021 11:39:30 GMT
- Title: CLIP-Adapter: Better Vision-Language Models with Feature Adapters
- Authors: Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng
Zhang, Hongsheng Li, Yu Qiao
- Abstract summary: We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
- Score: 79.52844563138493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale contrastive vision-language pre-training has shown significant
progress in visual representation learning. Unlike traditional visual systems
trained by a fixed set of discrete labels, a new paradigm was introduced in
\cite{radford2021learning} to directly learn to align images with raw texts in
an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt
is employed to make zero-shot predictions.~To avoid non-trivial prompt
engineering, context optimization \cite{zhou2021coop} has been proposed to
learn continuous vectors as task-specific prompts with few-shot training
examples.~In this paper, we show that there is an alternative path to achieve
better vision-language models other than prompt tuning.~While prompt tuning is
for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with
feature adapters on either visual or language branch. Specifically,
CLIP-Adapter adopts an additional bottleneck layer to learn new features and
performs residual-style feature blending with the original pre-trained
features.~As a consequence, CLIP-Adapter is able to outperform context
optimization while maintains a simple design. Experiments and extensive
ablation studies on various visual classification tasks demonstrate the
effectiveness of our approach.
Related papers
- In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model [13.983810804606264]
We propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks.
InCPL associates a new test sample with very few labeled examples as context information.
We introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples.
arXiv Detail & Related papers (2024-03-10T08:15:51Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [2.0434814235659555]
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning.
We propose to enhance CLIP via Visual-guided Texts, named VT-CLIP.
In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
arXiv Detail & Related papers (2021-12-04T18:34:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.