Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models
- URL: http://arxiv.org/abs/2512.03125v1
- Date: Tue, 02 Dec 2025 18:36:26 GMT
- Title: Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models
- Authors: Xiwen Wei, Mustafa Munir, Radu Marculescu,
- Abstract summary: Modality-Decoupled Experts (MoDE) is a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict.<n>MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings.
- Score: 25.457245885820484
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git
Related papers
- From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - Boosting Multimodal Learning via Disentangled Gradient Learning [6.93254775445168]
Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning.<n>We reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models.<n>We propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module.
arXiv Detail & Related papers (2025-07-14T12:31:28Z) - Continual Multimodal Contrastive Learning [99.53621521696051]
Multimodal Contrastive Learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space.<n>However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive.<n>In this paper, we formulate CMCL through two specialized principles of stability and plasticity.<n>We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z) - ReconBoost: Boosting Can Achieve Modality Reconcilement [89.4377895465204]
We study the modality-alternating learning paradigm to achieve reconcilement.
We propose a new method called ReconBoost to update a fixed modality each time.
We show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others.
arXiv Detail & Related papers (2024-05-15T13:22:39Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Rethinking Multimodal Content Moderation from an Asymmetric Angle with
Mixed-modality [14.594707272134414]
There is a rapidly growing need for multimodal content moderation (CM) on social media.
Existing unimodal CM systems may fail to catch harmful content that crosses modalities.
We present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and unimodal CM tasks.
arXiv Detail & Related papers (2023-05-17T20:06:29Z) - Towards Good Practices for Missing Modality Robust Action Recognition [20.26021126604409]
This paper seeks a set of good practices for multi-modal action recognition.
We study how to effectively regularize the model during training.
Second, we investigate on fusion methods for robustness to missing modalities.
Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding.
arXiv Detail & Related papers (2022-11-25T06:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.