Related papers: MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

URL: http://arxiv.org/abs/2403.10568v2
Date: Wed, 11 Sep 2024 09:19:43 GMT
Title: MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts
Authors: Ruixiang Jiang, Lingbo Liu, Changwen Chen,
Abstract summary: We introduce the mixture of prompt experts (MoPE) technique to enhance the expressiveness of prompt tuning. Our method achieves state-of-the-art results for prompt fusion, matching or even surpassing the performance of fine-tuning.
Score: 29.46189153751869
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the demonstrated parameter efficiency of prompt-based multimodal fusion methods, their limited adaptivity and expressiveness often result in suboptimal performance compared to other tuning approaches. In this paper, we address these limitations by decomposing the vanilla prompts to adaptively capture instance-level features. Building upon this decomposition, we introduce the mixture of prompt experts (MoPE) technique to enhance the expressiveness of prompt tuning. MoPE leverages multimodal pairing priors to route the most effective prompt on a per-instance basis. Compared to vanilla prompting, our MoPE-based fusion method exhibits greater expressiveness, scaling more effectively with the training data and the overall number of trainable parameters. We also investigate regularization terms for expert routing, which lead to emergent expert specialization during training, paving the way for interpretable soft prompting. Extensive experiments across six multimodal datasets spanning four modalities demonstrate that our method achieves state-of-the-art results for prompt fusion, matching or even surpassing the performance of fine-tuning while requiring only 0.8% of the trainable parameters. Code will be released: https://github.com/songrise/MoPE.

Related papers

TT-LoRA MoE: Unifying Parameter-Efficient Fine-Tuning and Sparse Mixture-of-Experts [4.5558042369389105]
TT-LoRA MoE decomposes training into two distinct optimized stages. First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts) Subsequently, these expert adapters remain frozen, eliminating inter-task interference and forgetting in multi-task setting. A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time. Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization.
arXiv Detail & Related papers (2025-04-29T21:46:43Z)
Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization [51.562474873972086]
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. We propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG.
arXiv Detail & Related papers (2025-04-29T11:06:03Z)
DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification [28.391851855141976]
We propose an efficient prompt-tuning framework tailored for multi-modal object re-identification. Our framework freezes the main backbone and only optimize several newly added decoupled modality-aware parameters. We show that our framework can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.
arXiv Detail & Related papers (2025-04-15T08:48:41Z)
Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning [3.8984478257737734]
Multi-modal models excel in cross-modal tasks but are computationally expensive due to their billions of parameters. Existing methods primarily focus on uni-modal processing, overlooking the critical modal fusion needed for multi-modal tasks. We propose a mixture of experts that extend the traditional PEFT framework to support multi-modal expert combinations and improve information interaction.
arXiv Detail & Related papers (2025-03-26T15:26:18Z)
EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities [20.991711160707755]
Missing modalities are a common challenge in real-world multimodal learning scenarios, occurring during both training and testing. Existing methods for managing missing modalities often require the design of separate prompts for each modality or missing case. We propose Evidence-based. Efficient Prompting (EPE-P), a novel and parameter-efficient method for pretrained multimodal networks.
arXiv Detail & Related papers (2024-12-23T16:01:12Z)
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z)
SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings [0.7349727826230863]
Soft prompt tuning techniques have gained traction as an effective strategy for the parameter-efficient tuning of pretrained language models. We introduce SuperPos-Prompt, a new re parameterization technique employing the superposition of multiple pretrained vocabulary embeddings to improve the learning of soft prompts. Our experiments consistently highlight SuperPos-Prompt's superiority over Residual Prompt tuning, exhibiting an average score increase of $+6.4$ in T5-Small and $+5.0$ in T5-Base. Remarkably, SuperPos-Prompt occasionally outperforms even full fine-tuning methods.
arXiv Detail & Related papers (2024-06-07T22:18:49Z)
On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning? [13.803180972839213]
We introduce a robust MeanShift for Test-time Augmentation (MTA) MTA surpasses prompt-based methods without requiring this intensive training procedure. We extensively benchmark our method on 15 datasets and demonstrate MTA's superiority and computational efficiency.
arXiv Detail & Related papers (2024-05-03T17:34:02Z)
Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications. MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling. Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z)
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation. DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning. We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Conditional Prompt Tuning for Multimodal Fusion [33.11221356852871]
We show that the representation of one modality can effectively guide the prompting of another modality for parameter-efficient multimodal fusion. This is achieved by disentangling the vanilla prompt vectors into three types of specialized prompts that adaptively capture global-level and instance-level features. Our method can effectively transfer the pretrained knowledge in unimodal encoders for downstream multimodal tasks.
arXiv Detail & Related papers (2023-11-28T11:05:20Z)
Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning [64.638804236566]
We propose a unified framework, UniPELT, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup. Remarkably, on the GLUE benchmark, UniPELT consistently achieves 13pt gains compared to the best individual PELT method that it incorporates and even outperforms fine-tuning under different setups.
arXiv Detail & Related papers (2021-10-14T17:40:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.