MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion
- URL: http://arxiv.org/abs/2403.10568v3
- Date: Tue, 14 Jan 2025 08:01:17 GMT
- Title: MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion
- Authors: Ruixiang Jiang, Lingbo Liu, Changwen Chen,
- Abstract summary: Mixture of Prompt Experts (MoPE) is first technique designed to overcome limitations by decomposing standard prompts.
Our MoPE-based fusion method exhibits greater expressiveness, scaling more effectively with the training data and the overall number of trainable parameters.
- Score: 29.46189153751869
- License:
- Abstract: Despite the demonstrated parameter efficiency of prompt-based multimodal fusion methods, their limited adaptivity and expressiveness often result in suboptimal performance compared to other tuning approaches. In this paper, we introduce the Mixture of Prompt Experts (MoPE), the first technique designed to overcome these limitations by decomposing standard prompts to capture instance-level features adaptively. Building on this decomposition, MoPE enhances prompt fusion's expressiveness by leveraging multimodal pairing priors to route the most effective prompt for each instance dynamically. Compared to vanilla prompting, our MoPE-based fusion method exhibits greater expressiveness, scaling more effectively with the training data and the overall number of trainable parameters. We also investigate regularization terms for expert routing, which lead to emergent expert specialization with enhanced adaptiveness and interpretablity. Extensive experiments across six multimodal datasets spanning four modalities demonstrate state-of-the-art performance for prompt fusion, matching or even surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Project homepage: https://github.com/songrise/MoPE
Related papers
- EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities [20.991711160707755]
Missing modalities are a common challenge in real-world multimodal learning scenarios, occurring during both training and testing.
Existing methods for managing missing modalities often require the design of separate prompts for each modality or missing case.
We propose Evidence-based.
Efficient Prompting (EPE-P), a novel and parameter-efficient method for pretrained multimodal networks.
arXiv Detail & Related papers (2024-12-23T16:01:12Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings [0.7349727826230863]
Soft prompt tuning techniques have gained traction as an effective strategy for the parameter-efficient tuning of pretrained language models.
We introduce SuperPos-Prompt, a new re parameterization technique employing the superposition of multiple pretrained vocabulary embeddings to improve the learning of soft prompts.
Our experiments consistently highlight SuperPos-Prompt's superiority over Residual Prompt tuning, exhibiting an average score increase of $+6.4$ in T5-Small and $+5.0$ in T5-Base.
Remarkably, SuperPos-Prompt occasionally outperforms even full fine-tuning methods.
arXiv Detail & Related papers (2024-06-07T22:18:49Z) - Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [33.834215393960605]
We introduce the Dynamic Mixture of Experts (DynMoE) technique to enhance the efficiency of training and inference for Transformer-based foundational models.
DynMoE incorporates a novel gating method that enables each token to automatically determine the number of experts to activate.
Our results demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks.
arXiv Detail & Related papers (2024-05-23T08:18:30Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Conditional Prompt Tuning for Multimodal Fusion [33.11221356852871]
We show that the representation of one modality can effectively guide the prompting of another modality for parameter-efficient multimodal fusion.
This is achieved by disentangling the vanilla prompt vectors into three types of specialized prompts that adaptively capture global-level and instance-level features.
Our method can effectively transfer the pretrained knowledge in unimodal encoders for downstream multimodal tasks.
arXiv Detail & Related papers (2023-11-28T11:05:20Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.