Multimodal Instruction Tuning with Conditional Mixture of LoRA
- URL: http://arxiv.org/abs/2402.15896v1
- Date: Sat, 24 Feb 2024 20:15:31 GMT
- Title: Multimodal Instruction Tuning with Conditional Mixture of LoRA
- Authors: Ying Shen, Zhiyang Xu, Qifan Wang, Yu Cheng, Wenpeng Yin, Lifu Huang
- Abstract summary: This paper introduces a novel approach that integrates multimodal instruction tuning with Low-Rank Adaption (LoRA)
It innovates upon LoRA by dynamically constructing low-rank adaptation matrices tailored to the unique demands of each input instance.
Experimental results on various multimodal evaluation datasets indicate that MixLoRA not only outperforms the conventional LoRA with the same or even higher ranks.
- Score: 54.65520214291653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable
proficiency in diverse tasks across different domains, with an increasing focus
on improving their zero-shot generalization capabilities for unseen multimodal
tasks. Multimodal instruction tuning has emerged as a successful strategy for
achieving zero-shot generalization by fine-tuning pre-trained models on diverse
multimodal tasks through instructions. As MLLMs grow in complexity and size,
the need for parameter-efficient fine-tuning methods like Low-Rank Adaption
(LoRA), which fine-tunes with a minimal set of parameters, becomes essential.
However, applying LoRA in multimodal instruction tuning presents the challenge
of task interference, which leads to performance degradation, especially when
dealing with a broad array of multimodal tasks. To address this, this paper
introduces a novel approach that integrates multimodal instruction tuning with
Conditional Mixture-of-LoRA (MixLoRA). It innovates upon LoRA by dynamically
constructing low-rank adaptation matrices tailored to the unique demands of
each input instance, aiming to mitigate task interference. Experimental results
on various multimodal evaluation datasets indicate that MixLoRA not only
outperforms the conventional LoRA with the same or even higher ranks,
demonstrating its efficacy and adaptability in diverse multimodal tasks.
Related papers
- MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models [4.978361907192563]
We introduce MeteoRA, a scalable multi-knowledge LoRA fusion framework designed for large language models (LLMs)
MeteoRA integrates various LoRA adapters in a Mixture-of-Experts (MoE) style into the base LLM, enabling the model to automatically select the most pertinent adapter based on the task input.
Our evaluations, featuring the LlaMA2-13B and LlaMA3-8B base models equipped with off-the-shelf 28 LoRA adapters through MeteoRA, demonstrate equivalent performance with the individual adapters.
arXiv Detail & Related papers (2024-05-19T20:46:07Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language
Models [7.966452497550907]
We propose the Mixture-of-LoRAs (MoA) architecture for multi-task learning with large language models (LLMs)
Multiple domain-specific LoRA modules can be aligned with the expert design principles observed in Mixture-of-Experts (MoE)
Each LoRA model can be iteratively adapted to a new domain, allowing for quick domain-specific adaptation.
arXiv Detail & Related papers (2024-03-06T03:33:48Z) - LLMBind: A Unified Modality-Task Integration Framework [38.95771765322677]
We introduce textbfLLMBind, a novel framework designed to unify a diverse array of multi-modal tasks.
By harnessing a Mixture-of-Experts (MoE) Large Language Model (LLM), LLMBind processes multi-modal inputs and generates task-specific tokens, enabling the invocation of corresponding models to accomplish tasks.
arXiv Detail & Related papers (2024-02-22T12:36:31Z) - MultiLoRA: Democratizing LoRA for Better Multi-Task Learning [20.750808913757396]
LoRA achieves remarkable resource efficiency and comparable performance when adapting LLMs for specific tasks.
LoRA is dominated by a small number of top singular vectors while fine-tuning decomposes into a set of less important unitary transforms.
We propose MultiLoRA for better multi-task adaptation by reducing the dominance of top singular vectors observed in LoRA.
arXiv Detail & Related papers (2023-11-20T02:59:18Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications [57.342772288710044]
We propose a novel parameter efficient fine-tuning framework for multi-task medical applications, dubbed as MOELoRA.
For unifying MOE and LoRA, we devise multiple experts as the trainable parameters, where each expert consists of a pair of low-rank matrices to retain the small size of trainable parameters.
We conduct experiments on a multi-task medical dataset, indicating MOELoRA outperforms the existing parameter efficient fine-tuning methods.
arXiv Detail & Related papers (2023-10-21T17:18:09Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.