Don't Stop Learning: Towards Continual Learning for the CLIP Model
- URL: http://arxiv.org/abs/2207.09248v2
- Date: Wed, 20 Jul 2022 02:21:33 GMT
- Title: Don't Stop Learning: Towards Continual Learning for the CLIP Model
- Authors: Yuxuan Ding, Lingqiao Liu, Chunna Tian, Jingyuan Yang, Haoxuan Ding
- Abstract summary: The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model.
This work conducts a systemic study on the continual learning issue of the CLIP model.
We propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.
- Score: 21.212839450030838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Contrastive Language-Image Pre-training (CLIP) Model is a recently
proposed large-scale pre-train model which attracts increasing attention in the
computer vision community. Benefiting from its gigantic image-text training
set, the CLIP model has learned outstanding capabilities in zero-shot learning
and image-text matching. To boost the recognition performance of CLIP on some
target visual concepts, it is often desirable to further update the CLIP model
by fine-tuning some classes-of-interest on extra training data. This operation,
however, raises an important concern: will the update hurt the zero-shot
learning or image-text matching capability of the CLIP, i.e., the catastrophic
forgetting issue? If yes, could existing continual learning algorithms be
adapted to alleviate the risk of catastrophic forgetting? To answer these
questions, this work conducts a systemic study on the continual learning issue
of the CLIP model. We construct evaluation protocols to measure the impact of
fine-tuning updates and explore different ways to upgrade existing continual
learning methods to mitigate the forgetting issue of the CLIP model. Our study
reveals the particular challenges of CLIP continual learning problem and lays a
foundation for further researches. Moreover, we propose a new algorithm, dubbed
Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact
effectiveness for alleviating the forgetting issue of the CLIP model.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Temporal-Difference Variational Continual Learning [89.32940051152782]
A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks.
In Continual Learning settings, models often struggle to balance learning new tasks with retaining previous knowledge.
We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.
arXiv Detail & Related papers (2024-10-10T10:58:41Z) - Toward a Holistic Evaluation of Robustness in CLIP Models [11.148206692373144]
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential in zero-shot classification.
This work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives.
In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts.
arXiv Detail & Related papers (2024-10-02T13:26:17Z) - CLIP model is an Efficient Online Lifelong Learner [5.170794699087535]
Vision-language models, such as Contrastive Language-Image Pretraining (CLIP), are more suitable candidates for online lifelong learning.
We introduce the Symmetric Image-Text (SIT) tuning strategy to maintain symmetry between image and text.
arXiv Detail & Related papers (2024-05-24T02:21:49Z) - ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning [54.68180752416519]
Panoptic segmentation is a cutting-edge computer vision task.
We introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE.
Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity.
arXiv Detail & Related papers (2024-03-29T11:31:12Z) - Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID)
We first analyze the role prompt learning in CLIP-ReID and identify its limitations.
Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z) - Incremental Object Detection with CLIP [36.478530086163744]
We propose a visual-language model such as CLIP to generate text feature embeddings for different class sets.
We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario.
We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance.
arXiv Detail & Related papers (2023-10-13T01:59:39Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP model is an Efficient Continual Learner [26.835116431183625]
We show that a frozen CLIP model offers astounding continual learning performance without any fine-tuning (zero-shot evaluation)
We evaluate CLIP under a variety of settings including class-incremental, domain-incremental and task-agnostic incremental learning on five popular benchmarks.
arXiv Detail & Related papers (2022-10-06T17:59:15Z) - Online Continual Learning with Contrastive Vision Transformer [67.72251876181497]
This paper proposes a framework Contrastive Vision Transformer (CVT) to achieve a better stability-plasticity trade-off for online CL.
Specifically, we design a new external attention mechanism for online CL that implicitly captures previous tasks' information.
Based on the learnable focuses, we design a focal contrastive loss to rebalance contrastive learning between new and past classes and consolidate previously learned representations.
arXiv Detail & Related papers (2022-07-24T08:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.