ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models
- URL: http://arxiv.org/abs/2410.05849v1
- Date: Tue, 8 Oct 2024 09:35:37 GMT
- Title: ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models
- Authors: Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, Cheng-Lin Liu,
- Abstract summary: Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed datasets jointly.
Existing methods leverage data replay or model expansion, both of which are not specially developed for LMMs.
We propose a novel dual-modality guided prompt learning framework (ModalPrompt) tailored for multimodal continual learning.
- Score: 40.7613157799378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed datasets jointly. However, novel tasks would be encountered sequentially in dynamic world, and continually fine-tuning LMMs often leads to performance degrades. To handle the challenges of catastrophic forgetting, existing methods leverage data replay or model expansion, both of which are not specially developed for LMMs and have their inherent limitations. In this paper, we propose a novel dual-modality guided prompt learning framework (ModalPrompt) tailored for multimodal continual learning to effectively learn new tasks while alleviating forgetting of previous knowledge. Concretely, we learn prototype prompts for each task and exploit efficient prompt selection for task identifiers and prompt fusion for knowledge transfer based on image-text supervision. Extensive experiments demonstrate the superiority of our approach, e.g., ModalPrompt achieves +20% performance gain on LMMs continual learning benchmarks with $\times$ 1.42 inference speed refraining from growing training cost in proportion to the number of tasks. The code will be made publically available.
Related papers
- Modality-Inconsistent Continual Learning of Multimodal Large Language Models [37.15220266767881]
We introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs)
Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting.
We propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities.
arXiv Detail & Related papers (2024-12-17T16:13:56Z) - LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant [63.28378110792787]
We introduce LamRA, a versatile framework designed to empower Large Multimodal Models with sophisticated retrieval and reranking capabilities.
For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning.
For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance.
arXiv Detail & Related papers (2024-12-02T17:10:16Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - Continual Instruction Tuning for Large Multimodal Models [30.438442723421556]
Multi-task joint instruction tuning can facilitate the model's continual learning ability and forgetting.
We propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs.
arXiv Detail & Related papers (2023-11-27T15:04:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.