Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning
- URL: http://arxiv.org/abs/2503.20633v1
- Date: Wed, 26 Mar 2025 15:26:18 GMT
- Title: Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning
- Authors: Sashuai Zhou, Hai Huang, Yan Xia,
- Abstract summary: Multi-modal models excel in cross-modal tasks but are computationally expensive due to their billions of parameters.<n>Existing methods primarily focus on uni-modal processing, overlooking the critical modal fusion needed for multi-modal tasks.<n>We propose a mixture of experts that extend the traditional PEFT framework to support multi-modal expert combinations and improve information interaction.
- Score: 3.8984478257737734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal models excel in cross-modal tasks but are computationally expensive due to their billions of parameters. Parameter-efficient fine-tuning (PEFT) offers a solution by adding small trainable components while freezing pre-trained parameters. However, existing methods primarily focus on uni-modal processing, overlooking the critical modal fusion needed for multi-modal tasks. To fill this gap, we propose heterogeneous mixture of experts adapters that extend the traditional PEFT framework to support multi-modal expert combinations and improve information interaction. Additionally, our approach modifies the affine linear expert design to enable efficient modal fusion in a low-rank space, achieving competitive performance with only 5-8\% of the parameters fine-tuned. Experiments across eight downstream tasks, including visual-audio and text-visual, demonstrate the superior performance of the approach.
Related papers
- CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation [6.013740443562439]
Multimodal Foundation Models (MFMs) excel at representing diverse raw modalities.
MFMs' application in sequential recommendation remains largely unexplored.
It remains unclear whether we can efficiently adapt multiple (>2) MFMs for the sequential recommendation task.
We propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN)
arXiv Detail & Related papers (2025-04-14T15:14:59Z) - Bridging Domain Gaps between Pretrained Multimodal Models and Recommendations [12.79899622986449]
textbfPTMRec is a novel framework that bridges the domain gap between pre-trained models and recommendation systems.
This framework not only eliminates the need for costly additional pre-training but also flexibly accommodates various parameter-efficient tuning methods.
arXiv Detail & Related papers (2025-02-21T15:50:14Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion [29.46189153751869]
Mixture of Prompt Experts (MoPE) is first technique designed to overcome limitations by decomposing standard prompts.<n>Our MoPE-based fusion method exhibits greater expressiveness, scaling more effectively with the training data and the overall number of trainable parameters.
arXiv Detail & Related papers (2024-03-14T17:47:10Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Parameter Efficient Multi-task Model Fusion with Partial Linearization [97.23530944186078]
We propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques.
Our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters.
We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model.
arXiv Detail & Related papers (2023-10-07T08:55:54Z) - MultiWay-Adapater: Adapting large-scale multi-modal models for scalable
image-text retrieval [4.4173427917548524]
MultiWay-Adapter (MWA) is a novel framework featuring an 'Alignment Enhancer'
This enhancer deepens inter-modal alignment, enabling high transferability with minimal tuning effort.
Experiments show that unlike prior efficient tuning approaches, MWA maintains model effectiveness, while reducing training time by up-to 57%.
arXiv Detail & Related papers (2023-09-04T10:48:29Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - Modular and Parameter-Efficient Multimodal Fusion with Prompting [4.2854066077037265]
Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings.
Our method is modular and parameter-efficient for processing tasks involving two or more data modalities.
arXiv Detail & Related papers (2022-03-15T16:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.