Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation
- URL: http://arxiv.org/abs/2511.06723v1
- Date: Mon, 10 Nov 2025 05:33:41 GMT
- Title: Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation
- Authors: Evelyn Chee, Wynne Hsu, Mong Li Lee,
- Abstract summary: Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge.<n>We propose a pre-trained model-based framework for multi-modal continual learning.<n>Our approach consistently outperforms baselines in both class-incremental and domain-incremental learning.
- Score: 23.701550460744738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing diverse sensory inputs, akin to human perception. However, multi-modal continual learning presents additional challenges, as the model must effectively integrate new information from various modalities while preventing catastrophic forgetting. In this work, we propose a pre-trained model-based framework for multi-modal continual learning. Our framework includes a novel cross-modality adapter with a mixture-of-experts structure to facilitate effective integration of multi-modal information across tasks. We also introduce a representation alignment loss that fosters learning of robust multi-modal representations, and regularize relationships between learned representations to preserve knowledge from previous tasks. Experiments on several multi-modal datasets demonstrate that our approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.
Related papers
- Balanced Multimodal Learning via Mutual Information [1.9336815376402718]
We propose a novel unified framework designed to address modality imbalance by utilizing mutual information to quantify interactions between modalities.<n>Our approach adopts a balanced multimodal learning strategy comprising two key stages: cross-modal knowledge distillation (KD) and a multitask-like training paradigm.
arXiv Detail & Related papers (2025-11-02T15:58:05Z) - Quantifying Cross-Modality Memorization in Vision-Language Models [86.82366725590508]
We study the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models.<n>Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities.
arXiv Detail & Related papers (2025-06-05T16:10:47Z) - Harmony: A Unified Framework for Modality Incremental Learning [81.13765007314781]
This paper investigates the feasibility of developing a unified model capable of incremental learning across continuously evolving modal sequences.<n>We propose a novel framework named Harmony, designed to achieve modal alignment and knowledge retention.<n>Our approach introduces the adaptive compatible feature modulation and cumulative modal bridging.
arXiv Detail & Related papers (2025-04-17T06:35:01Z) - Continual Learning for Multiple Modalities [6.23075162128532]
We propose a novel continual learning framework that accommodates multiple modalities.<n>We train a model to align various modalities with text, leveraging its rich semantic information.<n>To alleviate the overwriting of previous knowledge of modalities, we propose a framework that consolidates intra-modal knowledge.
arXiv Detail & Related papers (2025-03-11T05:50:13Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning [23.035725779568587]
We study the role and interactions of multiple modalities in mitigating forgetting in deep neural networks (DNNs)
Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations.
We propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality.
arXiv Detail & Related papers (2024-05-04T22:02:58Z) - Zero-Shot Relational Learning for Multimodal Knowledge Graphs [31.215889061734295]
We propose a novel end-to-end framework, consisting of three components, i.e., multimodal learner, structure consolidator embedding generator, to integrate diverse multimodal information and knowledge graph structures.<n> Evaluation results on three multimodal knowledge graphs demonstrate the superior performance of our proposed method.
arXiv Detail & Related papers (2024-04-09T11:14:45Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.