Unsupervised Prototype Adapter for Vision-Language Models
- URL: http://arxiv.org/abs/2308.11507v2
- Date: Fri, 25 Aug 2023 00:07:50 GMT
- Title: Unsupervised Prototype Adapter for Vision-Language Models
- Authors: Yi Zhang, Ce Zhang, Xueting Hu, Zhihai He
- Abstract summary: We design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter)
Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class.
After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks.
- Score: 29.516767588241724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large-scale pre-trained vision-language models (e.g. CLIP and
ALIGN) have demonstrated remarkable effectiveness in acquiring transferable
visual representations. To leverage the valuable knowledge encoded within these
models for downstream tasks, several fine-tuning approaches, including prompt
tuning methods and adapter-based methods, have been developed to adapt
vision-language models effectively with supervision. However, these methods
rely on the availability of annotated samples, which can be labor-intensive and
time-consuming to acquire, thus limiting scalability. To address this issue, in
this work, we design an unsupervised fine-tuning approach for vision-language
models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for
the unannotated target datasets, we leverage the text-image aligning capability
of CLIP to automatically select the most confident samples for each class.
Utilizing these selected samples, we generate class prototypes, which serve as
the initialization for the learnable prototype model. After fine-tuning, the
prototype model prediction is combined with the original CLIP's prediction by a
residual connection to perform downstream recognition tasks. Our extensive
experimental results on image recognition and domain generalization show that
the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter,
and also the state-of-the-art UPL method by large margins.
Related papers
- Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning [16.998833621046117]
We propose Test-Time Distribution LearNing Adapter (TT-DNA) which directly works during the testing period.
Specifically, we estimate Gaussian distributions to model visual features of the few-shot support images to capture the knowledge from the support set.
Our extensive experimental results on visual reasoning for human object interaction demonstrate that our proposed TT-DNA outperforms existing state-of-the-art methods by large margins.
arXiv Detail & Related papers (2024-03-10T01:34:45Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained
Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs.
Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required.
We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z) - Knowledge Distillation to Ensemble Global and Interpretable
Prototype-Based Mammogram Classification Models [20.16068689434846]
We propose BRAIxProtoPNet++, which adds interpretability to a global model by ensembling it with a prototype-based model.
We show that BRAIxProtoPNet++ has higher classification accuracy than SOTA global and prototype-based models.
arXiv Detail & Related papers (2022-09-26T05:04:15Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.