Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning
- URL: http://arxiv.org/abs/2411.13949v1
- Date: Thu, 21 Nov 2024 09:00:15 GMT
- Title: Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning
- Authors: Ziqi Wang, Chang Che, Qi Wang, Yangyang Li, Zenglin Shi, Meng Wang,
- Abstract summary: Visual instruction tuning enables large language models (MLLMs) to handle a wide range of vision tasks by framing them as language-based instructions.
We identify a dual form of catastrophic forgetting in CVIT, where MLLMs forget previously learned visual understanding but also experience a decline in instruction following abilities.
We introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules.
This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance.
- Score: 16.873306091966693
- License:
- Abstract: Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules - one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a novel CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions.
Related papers
- Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning [102.18178065928426]
We propose an efficient fine-tuning framework with two novel approaches: Vision Cue Enhancement (VCE) and Dual Low-Rank Adaptation (Dual-LoRA)
VCE enhances the vision projector by integrating multi-level visual cues, improving the model's ability to capture fine-grained visual features.
Dual-LoRA introduces a dual low-rank structure for instruction tuning, decoupling learning into skill and task spaces to enable precise control and efficient adaptation across diverse tasks.
arXiv Detail & Related papers (2024-11-19T11:03:09Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models [93.5327725085853]
Continual LLaVA is a rehearsal-free method tailored for continual instruction tuning in LVLMs.
Experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
arXiv Detail & Related papers (2024-11-04T19:55:32Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model [121.23360004498893]
We present a benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm.
Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting.
We introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment.
arXiv Detail & Related papers (2024-03-13T08:54:31Z) - Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning [68.94230363140771]
Mixture of Cluster-conditional LoRA Experts (MoCLE)
MoCLE is a novel Mixture of Experts architecture designed to activate the task-customized model parameters based on the instruction clusters.
Experiments on InstructBLIP and LLaVA demonstrate the effectiveness of MoCLE.
arXiv Detail & Related papers (2023-12-19T18:11:19Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Revisiting Unsupervised Meta-Learning: Amplifying or Compensating for
the Characteristics of Few-Shot Tasks [30.893785366366078]
We develop a practical approach towards few-shot image classification, where a visual recognition system is constructed with limited data.
We find that the base class set labels are not necessary, and discriminative embeddings could be meta-learned in an unsupervised manner.
Experiments on few-shot learning benchmarks verify our approaches outperform previous methods by a 4-10% performance gap.
arXiv Detail & Related papers (2020-11-30T10:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.