Related papers: Don't Stop Learning: Towards Continual Learning for the CLIP Model

Don't Stop Learning: Towards Continual Learning for the CLIP Model

URL: http://arxiv.org/abs/2207.09248v2
Date: Wed, 20 Jul 2022 02:21:33 GMT
Title: Don't Stop Learning: Towards Continual Learning for the CLIP Model
Authors: Yuxuan Ding, Lingqiao Liu, Chunna Tian, Jingyuan Yang, Haoxuan Ding
Abstract summary: The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model. This work conducts a systemic study on the continual learning issue of the CLIP model. We propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.
Score: 21.212839450030838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model which attracts increasing attention in the computer vision community. Benefiting from its gigantic image-text training set, the CLIP model has learned outstanding capabilities in zero-shot learning and image-text matching. To boost the recognition performance of CLIP on some target visual concepts, it is often desirable to further update the CLIP model by fine-tuning some classes-of-interest on extra training data. This operation, however, raises an important concern: will the update hurt the zero-shot learning or image-text matching capability of the CLIP, i.e., the catastrophic forgetting issue? If yes, could existing continual learning algorithms be adapted to alleviate the risk of catastrophic forgetting? To answer these questions, this work conducts a systemic study on the continual learning issue of the CLIP model. We construct evaluation protocols to measure the impact of fine-tuning updates and explore different ways to upgrade existing continual learning methods to mitigate the forgetting issue of the CLIP model. Our study reveals the particular challenges of CLIP continual learning problem and lays a foundation for further researches. Moreover, we propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.

Related papers

Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning [11.50324946279326]
Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks.<n>We analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models.<n>We propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning.
arXiv Detail & Related papers (2025-07-12T02:28:42Z)
Language Guided Concept Bottleneck Models for Interpretable Continual Learning [62.09201360376577]
Continual learning aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. We introduce a novel framework that integrates language-guided Concept Bottleneck Models to address both challenges.
arXiv Detail & Related papers (2025-03-30T02:41:55Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
Temporal-Difference Variational Continual Learning [89.32940051152782]
A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks. In Continual Learning settings, models often struggle to balance learning new tasks with retaining previous knowledge. We propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations.
arXiv Detail & Related papers (2024-10-10T10:58:41Z)
Toward a Holistic Evaluation of Robustness in CLIP Models [11.148206692373144]
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential in zero-shot classification. This work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts.
arXiv Detail & Related papers (2024-10-02T13:26:17Z)
CLIP model is an Efficient Online Lifelong Learner [5.170794699087535]
Vision-language models, such as Contrastive Language-Image Pretraining (CLIP), are more suitable candidates for online lifelong learning. We introduce the Symmetric Image-Text (SIT) tuning strategy to maintain symmetry between image and text.
arXiv Detail & Related papers (2024-05-24T02:21:49Z)
ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning [54.68180752416519]
Panoptic segmentation is a cutting-edge computer vision task. We introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE. Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity.
arXiv Detail & Related papers (2024-03-29T11:31:12Z)
Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) We first analyze the role prompt learning in CLIP-ReID and identify its limitations. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z)
Incremental Object Detection with CLIP [36.478530086163744]
We propose a visual-language model such as CLIP to generate text feature embeddings for different class sets. We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario. We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance.
arXiv Detail & Related papers (2023-10-13T01:59:39Z)
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
CLIP model is an Efficient Continual Learner [26.835116431183625]
We show that a frozen CLIP model offers astounding continual learning performance without any fine-tuning (zero-shot evaluation) We evaluate CLIP under a variety of settings including class-incremental, domain-incremental and task-agnostic incremental learning on five popular benchmarks.
arXiv Detail & Related papers (2022-10-06T17:59:15Z)
Online Continual Learning with Contrastive Vision Transformer [67.72251876181497]
This paper proposes a framework Contrastive Vision Transformer (CVT) to achieve a better stability-plasticity trade-off for online CL. Specifically, we design a new external attention mechanism for online CL that implicitly captures previous tasks' information. Based on the learnable focuses, we design a focal contrastive loss to rebalance contrastive learning between new and past classes and consolidate previously learned representations.
arXiv Detail & Related papers (2022-07-24T08:51:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.