Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
- URL: http://arxiv.org/abs/2311.02684v2
- Date: Wed, 13 Mar 2024 12:24:06 GMT
- Title: Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
- Authors: Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu,
Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao
- Abstract summary: Large Language Models (LLMs) can extend their zero-shot capabilities to multimodal learning through instruction tuning.
negative conflicts and interference may have a worse impact on performance.
We propose a novel framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with MLLMs.
- Score: 85.76186554492543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have demonstrated Large Language Models (LLMs) can extend
their zero-shot generalization capabilities to multimodal learning through
instruction tuning. As more modalities and downstream tasks are introduced,
negative conflicts and interference may have a worse impact on performance.
While this phenomenon has been overlooked in previous work, we propose a novel
and extensible framework, called Octavius, for comprehensive studies and
experimentation on multimodal learning with Multimodal Large Language Models
(MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and
one of the representative PEFT techniques, i.e., LoRA, designing a novel
LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our
knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to
address this problem. The experimental results (about 20% improvement) have
shown the effectiveness and versatility of our design in various 2D and 3D
downstream tasks. Code and datasets are available at
https://openlamm.github.io/paper_list/Octavius.
Related papers
- MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models [57.091523832149655]
We propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM.
Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE)
arXiv Detail & Related papers (2024-07-17T16:31:38Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Efficient Multimodal Large Language Models: A Survey [60.7614299984182]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning.
The extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry.
This survey provides a comprehensive and systematic review of the current state of efficient MLLMs.
arXiv Detail & Related papers (2024-05-17T12:37:10Z) - Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models [25.724995114710165]
We investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha.
Our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks.
arXiv Detail & Related papers (2024-03-10T12:43:27Z) - OneLLM: One Framework to Align All Modalities with Language [90.14915575477197]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework.
OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.