CLIP model is an Efficient Continual Learner
- URL: http://arxiv.org/abs/2210.03114v1
- Date: Thu, 6 Oct 2022 17:59:15 GMT
- Title: CLIP model is an Efficient Continual Learner
- Authors: Vishal Thengane, Salman Khan, Munawar Hayat, Fahad Khan
- Abstract summary: We show that a frozen CLIP model offers astounding continual learning performance without any fine-tuning (zero-shot evaluation)
We evaluate CLIP under a variety of settings including class-incremental, domain-incremental and task-agnostic incremental learning on five popular benchmarks.
- Score: 26.835116431183625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The continual learning setting aims to learn new tasks over time without
forgetting the previous ones. The literature reports several significant
efforts to tackle this problem with limited or no access to previous task data.
Among such efforts, typical solutions offer sophisticated techniques involving
memory replay, knowledge distillation, model regularization, and dynamic
network expansion. The resulting methods have a retraining cost at each
learning task, dedicated memory requirements, and setting-specific design
choices. In this work, we show that a frozen CLIP (Contrastive Language-Image
Pretraining) model offers astounding continual learning performance without any
fine-tuning (zero-shot evaluation). We evaluate CLIP under a variety of
settings including class-incremental, domain-incremental and task-agnostic
incremental learning on five popular benchmarks (ImageNet-100 & 1K, CORe50,
CIFAR-100, and TinyImageNet). Without any bells and whistles, the CLIP model
outperforms the state-of-the-art continual learning approaches in the majority
of the settings. We show the effect on the CLIP model's performance by varying
text inputs with simple prompt templates. To the best of our knowledge, this is
the first work to report the CLIP zero-shot performance in a continual setting.
We advocate the use of this strong yet embarrassingly simple baseline for
future comparisons in the continual learning tasks.
Related papers
- CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP [56.199779065855004]
We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations.
Experiments on the CIFAR-100 and Flickr30K datasets demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples.
arXiv Detail & Related papers (2024-10-30T17:51:31Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Incremental Object Detection with CLIP [36.478530086163744]
We propose a visual-language model such as CLIP to generate text feature embeddings for different class sets.
We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario.
We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance.
arXiv Detail & Related papers (2023-10-13T01:59:39Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Don't Stop Learning: Towards Continual Learning for the CLIP Model [21.212839450030838]
The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model.
This work conducts a systemic study on the continual learning issue of the CLIP model.
We propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.
arXiv Detail & Related papers (2022-07-19T13:03:14Z) - CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment [102.17010696898113]
We show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.
We propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task.
arXiv Detail & Related papers (2022-03-14T15:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.