Efficient Multimodal Fusion via Interactive Prompting
- URL: http://arxiv.org/abs/2304.06306v2
- Date: Mon, 15 May 2023 04:42:33 GMT
- Title: Efficient Multimodal Fusion via Interactive Prompting
- Authors: Yaowei Li, Ruijie Quan, Linchao Zhu, Yi Yang
- Abstract summary: Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
- Score: 62.08292938484994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pre-training has brought unimodal fields such as computer vision
and natural language processing to a new era. Following this trend, the size of
multi-modal learning models constantly increases, leading to an urgent need to
reduce the massive computational cost of finetuning these models for downstream
tasks. In this paper, we propose an efficient and flexible multimodal fusion
method, namely PMF, tailored for fusing unimodally pre-trained transformers.
Specifically, we first present a modular multimodal fusion framework that
exhibits high flexibility and facilitates mutual interactions among different
modalities. In addition, we disentangle vanilla prompts into three types in
order to learn different optimizing objectives for multimodal learning. It is
also worth noting that we propose to add prompt vectors only on the deep layers
of the unimodal transformers, thus significantly reducing the training memory
usage. Experiment results show that our proposed method achieves comparable
performance to several other multimodal finetuning methods with less than 3%
trainable parameters and up to 66% saving of training memory usage.
Related papers
- LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment [0.0]
"Harmonized Transfer Learning and Modality alignment (HarMA)" is a method that simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment.
HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the field of remote sensing.
arXiv Detail & Related papers (2024-04-28T17:20:08Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Dynamic Multimodal Fusion [8.530680502975095]
Dynamic multimodal fusion (DynMM) is a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference.
Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach.
arXiv Detail & Related papers (2022-03-31T21:35:13Z) - Modular and Parameter-Efficient Multimodal Fusion with Prompting [4.2854066077037265]
Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings.
Our method is modular and parameter-efficient for processing tasks involving two or more data modalities.
arXiv Detail & Related papers (2022-03-15T16:50:15Z) - Sparse Fusion for Multimodal Transformers [7.98117428941095]
We present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers.
Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling.
State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements.
arXiv Detail & Related papers (2021-11-23T16:43:49Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.