Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis
- URL: http://arxiv.org/abs/2212.00678v1
- Date: Thu, 1 Dec 2022 17:31:42 GMT
- Title: Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis
- Authors: Odysseas S. Chlapanis, Georgios Paraskevopoulos, Alexandros Potamianos
- Abstract summary: We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
- Score: 84.12658971655253
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal learning pipelines have benefited from the success of pretrained
language models. However, this comes at the cost of increased model parameters.
In this work, we propose Adapted Multimodal BERT (AMB), a BERT-based
architecture for multimodal tasks that uses a combination of adapter modules
and intermediate fusion layers. The adapter adjusts the pretrained language
model for the task at hand, while the fusion layers perform task-specific,
layer-wise fusion of audio-visual information with textual BERT
representations. During the adaptation process the pre-trained language model
parameters remain frozen, allowing for fast, parameter-efficient training. In
our ablations we see that this approach leads to efficient models, that can
outperform their fine-tuned counterparts and are robust to input noise. Our
experiments on sentiment analysis with CMU-MOSEI show that AMB outperforms the
current state-of-the-art across metrics, with 3.4% relative reduction in the
resulting error and 2.1% relative improvement in 7-class classification
accuracy.
Related papers
- Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting
Pre-trained Language Models [22.977852629450346]
We propose a method that combines two popular research areas by injecting linguistic structures into pre-trained language models.
In our approach, parallel adapter modules encoding different linguistic structures are combined using a novel Mixture-of-Linguistic-Experts architecture.
Our experiment results show that our approach can outperform state-of-the-art PEFT methods with a comparable number of parameters.
arXiv Detail & Related papers (2023-10-24T23:29:06Z) - Mixture-of-Expert Conformer for Streaming Multilingual ASR [33.14594179710925]
We propose a streaming truly multilingual Conformer incorporating mixture-of-expert layers.
The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases.
We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline.
arXiv Detail & Related papers (2023-05-25T02:16:32Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth
Mover's Distance [25.229624487344186]
High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices.
We propose a novel BERT distillation method based on many-to-many layer mapping.
Our model can learn from different teacher layers adaptively for various NLP tasks.
arXiv Detail & Related papers (2020-10-13T02:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.