Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors
- URL: http://arxiv.org/abs/2505.20680v1
- Date: Tue, 27 May 2025 03:51:37 GMT
- Title: Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors
- Authors: Haodong Lu, Xinyu Zhang, Kristen Moore, Jason Xue, Lina Yao, Anton van den Hengel, Dong Gong,
- Abstract summary: Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting.<n>We propose a concise CL approach for CLIP based on incremental prompt tuning.<n>We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting.
- Score: 50.7383184560431
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP's intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.
Related papers
- Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning [19.210280671911278]
Continual learning aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge.<n>We propose a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer.
arXiv Detail & Related papers (2025-08-03T04:09:00Z) - Integrated Structural Prompt Learning for Vision-Language Models [15.002501540565781]
In this paper, we propose an Integrated Structural Prompt (ISP) for Vision-Language Models (VLMs)<n>ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens.<n>ISP achieves competitive performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-07-08T04:59:58Z) - ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP [12.031278034659872]
Continual learning empowers pre-trained vision-language models to adapt effectively to novel or previously underrepresented data distributions.<n>ChordPrompt introduces cross-modal prompts to leverage interactions between visual and textual information.<n>ChordPrompt outperforms state-of-the-art methods in zero-shot generalization and downstream task performance.
arXiv Detail & Related papers (2025-06-24T13:22:06Z) - Adapter-Enhanced Semantic Prompting for Continual Learning [91.63494614012362]
Continual learning (CL) enables models to adapt to evolving data streams.<n>Traditional methods usually retain the past data for replay or add additional branches in the model to learn new knowledge.<n>We propose a novel lightweight CL framework, which integrates prompt tuning and adapter techniques.
arXiv Detail & Related papers (2024-12-15T06:14:55Z) - Continual Task Learning through Adaptive Policy Self-Composition [54.95680427960524]
CompoFormer is a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network.
Our experiments reveal that CompoFormer outperforms conventional continual learning (CL) methods, particularly in longer task sequences.
arXiv Detail & Related papers (2024-11-18T08:20:21Z) - Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning [11.033050922826934]
We introduce SpLIP, a novel multi-modal prompt learning scheme designed to operate with frozen CLIP backbones.
SpLIP implements a bi-directional prompt-sharing strategy that enables mutual knowledge exchange between CLIP's visual and textual encoders.
We propose two innovative strategies for further refining the embedding space.
arXiv Detail & Related papers (2024-07-05T01:30:42Z) - CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models [23.398619576886375]
Continual learning (CL) aims to help deep neural networks learn new knowledge while retaining what has been learned.
Our work proposes Continual LeArning with Probabilistic finetuning (CLAP) - a probabilistic modeling framework over visual-guided text features per task.
arXiv Detail & Related papers (2024-03-28T04:15:58Z) - Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization [101.08992036691673]
This paper explores a realistic unsupervised fine-tuning scenario, considering the presence of out-of-distribution samples from unknown classes.
In particular, we focus on simultaneously enhancing out-of-distribution detection and the recognition of instances associated with known classes.
We present a simple, efficient, and effective approach called Universal Entropy Optimization (UEO)
arXiv Detail & Related papers (2023-08-24T16:47:17Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - POP: Prompt Of Prompts for Continual Learning [59.15888651733645]
Continual learning (CL) aims to mimic the human ability to learn new concepts without catastrophic forgetting.
We show that a foundation model equipped with POP learning is able to outperform classic CL methods by a significant margin.
arXiv Detail & Related papers (2023-06-14T02:09:26Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.