Related papers: GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning

GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning

URL: http://arxiv.org/abs/2507.19839v1
Date: Sat, 26 Jul 2025 07:22:12 GMT
Title: GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning
Authors: Tiantian Peng, Yuyang Liu, Shuo Yang, Qiuhe Hong, YongHong Tian,
Abstract summary: Contrastive Language-Image Pretraining has demonstrated remarkable zero-shot generalization by aligning visual and textual modalities in a shared embedding space.<n>When continuously fine-tuned on diverse tasks, CLIP suffers from catastrophic forgetting and degradation of its embedding alignment.<n>We propose Gradient Null Space Projection (GNSP), an efficient continual learning method that projects task-specific gradients onto the null space of previously learned knowledge.
Score: 27.9960664846484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language-Image Pretraining has demonstrated remarkable zero-shot generalization by aligning visual and textual modalities in a shared embedding space. However, when continuously fine-tuned on diverse tasks, CLIP suffers from catastrophic forgetting and degradation of its embedding alignment, undermining its zero-shot capabilities. In this work, we propose Gradient Null Space Projection (GNSP), an efficient continual learning method that projects task-specific gradients onto the null space of previously learned knowledge. This orthogonal projection mathematically prevents interference with previous tasks without relying on rehearsal or architectural modification. Furthermore, to preserve the inherent generalization property of CLIP, we introduce knowledge distillation and combine it with a modality alignment preservation loss inspired by CLIP pre-training to stabilize the structure of the multimodal embedding space during fine-tuning. On the MTIL benchmark consisting of 11 tasks, our method achieved SOTA performance on both the Average and Last key metrics. More importantly, experiments show that our method successfully maintains the original modality gap and cross-modal retrieval performance of CLIP, confirming its effectiveness in maintaining a robust visual-language space throughout the continual learning process.

Related papers

Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning [11.50324946279326]
Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks.<n>We analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models.<n>We propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning.
arXiv Detail & Related papers (2025-07-12T02:28:42Z)
SplitLoRA: Balancing Stability and Plasticity in Continual Learning Through Gradient Space Splitting [68.00007494819798]
Continual learning requires a model to learn multiple tasks in sequence while maintaining both stability:preserving knowledge from previously learned tasks, and plasticity:effectively learning new tasks.<n> Gradient projection has emerged as an effective and popular paradigm in CL, where it partitions the gradient space of previously learned tasks into two subspaces.<n>New tasks are learned effectively within the minor subspace, thereby reducing interference with previously acquired knowledge.<n>Existing Gradient Projection methods struggle to achieve an optimal balance between plasticity and stability, as it is hard to appropriately partition the gradient space.
arXiv Detail & Related papers (2025-05-28T13:57:56Z)
Post-pre-training for Modality Alignment in Vision-Language Foundation Models [12.110530026601968]
This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning.<n>It aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations.
arXiv Detail & Related papers (2025-04-17T07:46:19Z)
CODE-CL: Conceptor-Based Gradient Projection for Deep Continual Learning [6.738409533239947]
Deep neural networks struggle with catastrophic forgetting when learning tasks sequentially.<n>Recent approaches constrain updates to subspaces using gradient projection.<n>We propose Conceptor-based gradient projection for Deep Continual Learning (CODE-CL)
arXiv Detail & Related papers (2024-11-21T22:31:06Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation [38.16802763051431]
We propose CLIPtrase, a training-free semantic segmentation strategy. It enhances local feature awareness through recalibrated self-correlation among patches. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks.
arXiv Detail & Related papers (2024-07-11T08:12:16Z)
Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL) This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z)
Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z)
Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks. We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z)
When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework. We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.