Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework
- URL: http://arxiv.org/abs/2602.18055v1
- Date: Fri, 20 Feb 2026 08:15:28 GMT
- Title: Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework
- Authors: Jingyang Qiao, Zhizhong Zhang, Xin Tan, Jingyu Gong, Yanyun Qu, Yuan Xie,
- Abstract summary: Multimodal Large Language Models (MLLMs) can enable unified multimodal comprehension and generation through text and image modalities.<n>Despite strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution.<n>No standardized continual learning framework for Dual-to-Dual MLLMs has been established yet.
- Score: 48.74174551777241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual adaptation to dynamic real-world scenarios. One of the challenges is that learning new tasks inevitably destroys the learned knowledge. Beyond traditional catastrophic forgetting, Dual-to-Dual MLLMs face other challenges, including hallucination, instruction unfollowing, and failures in cross-modal knowledge transfer. However, no standardized continual learning framework for Dual-to-Dual MLLMs has been established yet, leaving these challenges unexplored. Thus, in this paper, we establish Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics. To improve the continual learning capability of Dual-to-Dual MLLMs, we propose an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting. Extensive experiments demonstrate that MAGE outperforms other continual learning methods and achieves state-of-the-art performance.
Related papers
- MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering [36.80441487363007]
MLLMEraser is an input-aware, training-free framework for test-time unlearning.<n>We construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs.<n>Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines.
arXiv Detail & Related papers (2025-10-05T14:20:17Z) - MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images.<n>MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs.<n>This paper argues that MLLMs are deeply affected by modality bias, highlighting its manifestations across various tasks.
arXiv Detail & Related papers (2025-05-24T11:49:31Z) - Uncovering inequalities in new knowledge learning by large language models across different languages [66.687369838071]
We show that low-resource languages consistently face disadvantages across all four dimensions.<n>We aim to raise awareness of linguistic inequalities in LLMs' new knowledge learning, fostering the development of more inclusive and equitable future LLMs.
arXiv Detail & Related papers (2025-03-06T03:41:47Z) - Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models [52.569132872560814]
multimodal large language models (MLLMs) have achieved significant breakthroughs, enhancing understanding across text and vision.<n>However, current MLLMs still face challenges in effectively integrating knowledge across these modalities during multimodal knowledge reasoning.<n>We analyze and compare the extent of consistency degradation in multimodal knowledge reasoning within MLLMs.
arXiv Detail & Related papers (2025-03-03T09:01:51Z) - CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering [27.812611421754482]
We propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering (VQA)<n>We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs.<n>Our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach.
arXiv Detail & Related papers (2025-03-01T09:25:23Z) - SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning [16.873306091966693]
Visual instruction tuning enables large language models (MLLMs) to handle a wide range of vision tasks by framing them as language-based instructions.<n>We identify a dual form of catastrophic forgetting in CVIT, where MLLMs forget previously learned visual understanding and also experience a decline in instruction following abilities.<n>We introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules-one for visual understanding and another for instruction following.
arXiv Detail & Related papers (2024-11-21T09:00:15Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt [51.71932333475573]
Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed instruction datasets.<n>Existing MCIT methods do not fully exploit the unique attribute of LMMs.<n>We propose a novel prompt learning framework for MCIT to effectively alleviate forgetting of previous knowledge.
arXiv Detail & Related papers (2024-10-08T09:35:37Z) - M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning [9.15567555909617]
M2Distill is a multi-modal distillation-based method for lifelong imitation learning.<n>We regulate the shifts in latent representations across different modalities from previous to current steps.<n>We ensure that the learned policy retains its ability to perform previously learned tasks while seamlessly integrating new skills.
arXiv Detail & Related papers (2024-09-30T01:43:06Z) - CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model [121.23360004498893]
We present a benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm.
Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting.
We introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment.
arXiv Detail & Related papers (2024-03-13T08:54:31Z) - Continual Instruction Tuning for Large Multimodal Models [30.438442723421556]
Multi-task joint instruction tuning can facilitate the model's continual learning ability and forgetting.
We propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs.
arXiv Detail & Related papers (2023-11-27T15:04:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.