Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
- URL: http://arxiv.org/abs/2311.02684v2
- Date: Wed, 13 Mar 2024 12:24:06 GMT
- Title: Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
- Authors: Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu,
Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao
- Abstract summary: Large Language Models (LLMs) can extend their zero-shot capabilities to multimodal learning through instruction tuning.
negative conflicts and interference may have a worse impact on performance.
We propose a novel framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with MLLMs.
- Score: 85.76186554492543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have demonstrated Large Language Models (LLMs) can extend
their zero-shot generalization capabilities to multimodal learning through
instruction tuning. As more modalities and downstream tasks are introduced,
negative conflicts and interference may have a worse impact on performance.
While this phenomenon has been overlooked in previous work, we propose a novel
and extensible framework, called Octavius, for comprehensive studies and
experimentation on multimodal learning with Multimodal Large Language Models
(MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and
one of the representative PEFT techniques, i.e., LoRA, designing a novel
LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our
knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to
address this problem. The experimental results (about 20% improvement) have
shown the effectiveness and versatility of our design in various 2D and 3D
downstream tasks. Code and datasets are available at
https://openlamm.github.io/paper_list/Octavius.
Related papers
- LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models [57.091523832149655]
We propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM.
Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE)
arXiv Detail & Related papers (2024-07-17T16:31:38Z) - Efficient Multimodal Large Language Models: A Survey [60.7614299984182]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning.
The extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry.
This survey provides a comprehensive and systematic review of the current state of efficient MLLMs.
arXiv Detail & Related papers (2024-05-17T12:37:10Z) - The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models [19.213774611556]
Multi-modal large language models (MLLMs) integrate verbal and visual information.
Despite the revolutionizing prospect of MLLMs, our understanding of their reasoning abilities is limited.
In this study, we assess the nonverbal abstract reasoning abilities of open-source and closed-source MLLMs.
arXiv Detail & Related papers (2024-01-22T16:57:05Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.