Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
- URL: http://arxiv.org/abs/2511.15164v1
- Date: Wed, 19 Nov 2025 06:29:15 GMT
- Title: Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
- Authors: Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang,
- Abstract summary: Multimodal continual instruction tuning enables large language models to sequentially adapt to new tasks while building upon previously acquired knowledge.<n>However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones.<n>We introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning.
- Score: 41.58239719458457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.
Related papers
- Dynamic Orthogonal Continual Fine-tuning for Mitigating Catastrophic Forgettings [13.325021114990241]
Catastrophic forgetting remains a critical challenge in continual learning for large language models.<n>We propose Dynamic Orthogonal Continual (DOC) fine-tuning, a novel approach that tracks the drift of functional directions and dynamically updates them during the fine-tuning process.
arXiv Detail & Related papers (2025-09-28T13:55:05Z) - Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning [57.514786046966265]
We propose textbfPerturb-and-Merge (P&M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting.<n>Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
arXiv Detail & Related papers (2025-05-28T14:14:19Z) - Continuous Subspace Optimization for Continual Learning [24.597922531045846]
Continual learning aims to learn multiple tasks sequentially while preserving prior knowledge.<n>We propose Continuous Subspace Optimization for Continual Learning (CoSO) to fine-tune the model in a series of subspaces rather than a single one.<n>CoSO significantly outperforms state-of-the-art methods, especially in challenging scenarios with long task sequences.
arXiv Detail & Related papers (2025-05-17T03:53:21Z) - Gradient Projection For Continual Parameter-Efficient Tuning [42.800411328615894]
We reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection.
We show that the condition for the gradient can effectively resist forgetting even for large-scale models.
We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets.
arXiv Detail & Related papers (2024-05-22T06:33:48Z) - Continual Learning with Scaled Gradient Projection [8.847574864259391]
In neural networks, continual learning results in gradient interference among sequential tasks, leading to forgetting of old tasks while learning new ones.
We propose a Scaled Gradient Projection (SGP) method to improve new learning while minimizing forgetting.
We conduct experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.
arXiv Detail & Related papers (2023-02-02T19:46:39Z) - FOSTER: Feature Boosting and Compression for Class-Incremental Learning [52.603520403933985]
Deep neural networks suffer from catastrophic forgetting when learning new categories.
We propose a novel two-stage learning paradigm FOSTER, empowering the model to learn new categories adaptively.
arXiv Detail & Related papers (2022-04-10T11:38:33Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize.
We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.