Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
- URL: http://arxiv.org/abs/2508.04227v1
- Date: Wed, 06 Aug 2025 09:03:10 GMT
- Title: Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
- Authors: Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian,
- Abstract summary: Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
- Score: 70.83781268763215
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.
Related papers
- Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration [40.720288165545476]
We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features.<n>Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment.
arXiv Detail & Related papers (2026-02-03T06:06:35Z) - Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning [41.523848964102]
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL)<n>RL provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience.<n>Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties.<n>We propose DoGe, a dual-decoupling framework that guides models to first learn from context rather than problem solving.
arXiv Detail & Related papers (2025-12-07T13:17:31Z) - WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives [36.297745473653166]
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language.<n>Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress.
arXiv Detail & Related papers (2025-05-20T13:47:40Z) - Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models [19.71113926850385]
The AFA method significantly outperforms existing state-of-the-art approaches.<n>It surpasses the inherent zero-shot performance of CLIP in terms of transferability.
arXiv Detail & Related papers (2025-05-12T15:56:23Z) - Enhanced Continual Learning of Vision-Language Models with Model Fusion [16.764069327701186]
Vision-Language Models (VLMs) represent a breakthrough in artificial intelligence.<n>VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks.<n>We propose Continual Decoupling-Unifying (ConDU), a novel approach, by introducing model fusion into continual learning.
arXiv Detail & Related papers (2025-03-12T15:48:13Z) - Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.<n> tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z) - Online Continual Learning: A Systematic Literature Review of Approaches, Challenges, and Benchmarks [1.3631535881390204]
Online Continual Learning (OCL) is a critical area in machine learning.<n>This study conducts the first comprehensive Systematic Literature Review on OCL.
arXiv Detail & Related papers (2025-01-09T01:03:14Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.