Class Incremental Learning with Pre-trained Vision-Language Models
- URL: http://arxiv.org/abs/2310.20348v1
- Date: Tue, 31 Oct 2023 10:45:03 GMT
- Title: Class Incremental Learning with Pre-trained Vision-Language Models
- Authors: Xialei Liu, Xusheng Cao, Haori Lu, Jia-wen Xiao, Andrew D. Bagdanov,
Ming-Ming Cheng
- Abstract summary: We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
- Score: 59.15538370859431
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the advent of large-scale pre-trained models, interest in adapting and
exploiting them for continual learning scenarios has grown.
In this paper, we propose an approach to exploiting pre-trained
vision-language models (e.g. CLIP) that enables further adaptation instead of
only using zero-shot learning of new tasks. We augment a pre-trained CLIP model
with additional layers after the Image Encoder or before the Text Encoder. We
investigate three different strategies: a Linear Adapter, a Self-attention
Adapter, each operating on the image embedding, and Prompt Tuning which instead
modifies prompts input to the CLIP text encoder. We also propose a method for
parameter retention in the adapter layers that uses a measure of parameter
importance to better maintain stability and plasticity during incremental
learning. Our experiments demonstrate that the simplest solution -- a single
Linear Adapter layer with parameter retention -- produces the best results.
Experiments on several conventional benchmarks consistently show a significant
margin of improvement over the current state-of-the-art.
Related papers
- CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning [17.614980614656407]
We propose Continual Generative training for Incremental prompt-Learning.
We exploit Variational Autoencoders to learn class-conditioned distributions.
We show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities.
arXiv Detail & Related papers (2024-07-22T16:51:28Z) - Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion [10.322832012497722]
Class-incremental learning is a challenging problem, where the goal is to train a model that can classify data from an increasing number of classes over time.
With the advancement of vision-language pre-trained models such as CLIP, they demonstrate good generalization ability.
However, further adaptation to downstream tasks by simply fine-tuning the model leads to severe forgetting.
Most existing works with pre-trained models assume that the forgetting of old classes is uniform when the model acquires new knowledge.
arXiv Detail & Related papers (2024-07-19T09:20:33Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer [44.10678347943115]
Class-incremental learning (CIL) aims to enable models to continuously learn new classes while overcoming catastrophic forgetting.
In this paper, we revisit different parameter-efficient tuning (PET) methods within the context of continual learning.
We observe that adapter tuning demonstrates superiority over prompt-based methods, even without parameter expansion in each learning session.
arXiv Detail & Related papers (2024-03-29T05:23:12Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.