Don't Stop Learning: Towards Continual Learning for the CLIP Model
- URL: http://arxiv.org/abs/2207.09248v2
- Date: Wed, 20 Jul 2022 02:21:33 GMT
- Title: Don't Stop Learning: Towards Continual Learning for the CLIP Model
- Authors: Yuxuan Ding, Lingqiao Liu, Chunna Tian, Jingyuan Yang, Haoxuan Ding
- Abstract summary: The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model.
This work conducts a systemic study on the continual learning issue of the CLIP model.
We propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.
- Score: 21.212839450030838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Contrastive Language-Image Pre-training (CLIP) Model is a recently
proposed large-scale pre-train model which attracts increasing attention in the
computer vision community. Benefiting from its gigantic image-text training
set, the CLIP model has learned outstanding capabilities in zero-shot learning
and image-text matching. To boost the recognition performance of CLIP on some
target visual concepts, it is often desirable to further update the CLIP model
by fine-tuning some classes-of-interest on extra training data. This operation,
however, raises an important concern: will the update hurt the zero-shot
learning or image-text matching capability of the CLIP, i.e., the catastrophic
forgetting issue? If yes, could existing continual learning algorithms be
adapted to alleviate the risk of catastrophic forgetting? To answer these
questions, this work conducts a systemic study on the continual learning issue
of the CLIP model. We construct evaluation protocols to measure the impact of
fine-tuning updates and explore different ways to upgrade existing continual
learning methods to mitigate the forgetting issue of the CLIP model. Our study
reveals the particular challenges of CLIP continual learning problem and lays a
foundation for further researches. Moreover, we propose a new algorithm, dubbed
Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact
effectiveness for alleviating the forgetting issue of the CLIP model.
Related papers
- CLIP model is an Efficient Online Lifelong Learner [5.170794699087535]
Vision-language models, such as Contrastive Language-Image Pretraining (CLIP), are more suitable candidates for online lifelong learning.
We introduce the Symmetric Image-Text (SIT) tuning strategy to maintain symmetry between image and text.
arXiv Detail & Related papers (2024-05-24T02:21:49Z) - ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning [54.68180752416519]
Panoptic segmentation is a cutting-edge computer vision task.
We introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE.
Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity.
arXiv Detail & Related papers (2024-03-29T11:31:12Z) - Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID)
We first analyze the role prompt learning in CLIP-ReID and identify its limitations.
Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Incremental Object Detection with CLIP [36.478530086163744]
We propose a visual-language model such as CLIP to generate text feature embeddings for different class sets.
We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario.
We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance.
arXiv Detail & Related papers (2023-10-13T01:59:39Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Multimodal Parameter-Efficient Few-Shot Class Incremental Learning [1.9220716793379256]
Few-Shot Class Incremental Learning (FSCIL) is a challenging continual learning task, where limited training examples are available during several learning sessions.
To succeed in this task, it is necessary to avoid over-fitting new classes caused by biased distributions in the few-shot training sets.
CPE-CLIP significantly improves FSCIL performance compared to state-of-the-art proposals while also drastically reducing the number of learnable parameters and training costs.
arXiv Detail & Related papers (2023-03-08T17:34:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP model is an Efficient Continual Learner [26.835116431183625]
We show that a frozen CLIP model offers astounding continual learning performance without any fine-tuning (zero-shot evaluation)
We evaluate CLIP under a variety of settings including class-incremental, domain-incremental and task-agnostic incremental learning on five popular benchmarks.
arXiv Detail & Related papers (2022-10-06T17:59:15Z) - Online Continual Learning with Contrastive Vision Transformer [67.72251876181497]
This paper proposes a framework Contrastive Vision Transformer (CVT) to achieve a better stability-plasticity trade-off for online CL.
Specifically, we design a new external attention mechanism for online CL that implicitly captures previous tasks' information.
Based on the learnable focuses, we design a focal contrastive loss to rebalance contrastive learning between new and past classes and consolidate previously learned representations.
arXiv Detail & Related papers (2022-07-24T08:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.