Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For
Multimodal Emotion Recognition
- URL: http://arxiv.org/abs/2302.13661v1
- Date: Mon, 27 Feb 2023 10:59:08 GMT
- Title: Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For
Multimodal Emotion Recognition
- Authors: Dekai Sun, Yancheng He, Jiqing Han
- Abstract summary: We propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality.
For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module.
We achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.
- Score: 24.115771176570824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The lack of data and the difficulty of multimodal fusion have always been
challenges for multimodal emotion recognition (MER). In this paper, we propose
to use pretrained models as upstream network, wav2vec 2.0 for audio modality
and BERT for text modality, and finetune them in downstream task of MER to cope
with the lack of data. For the difficulty of multimodal fusion, we use a
K-layer multi-head attention mechanism as a downstream fusion module. Starting
from the MER task itself, we design two auxiliary tasks to alleviate the
insufficient fusion between modalities and guide the network to capture and
align emotion-related features. Compared to the previous state-of-the-art
models, we achieve a better performance by 78.42% Weighted Accuracy (WA) and
79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.
Related papers
- GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis [0.0]
Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze human sentiment.
Existing MSA models generally employ cutting-edge multimodal fusion and representation learning-based methods to promote MSA capability.
Our proposed GSIFN incorporates two main components to solve these problems: (i) a graph-structured and interlaced-masked multimodal Transformer.
It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computational overhead.
arXiv Detail & Related papers (2024-08-27T06:44:28Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion
Recognition [15.133202035812017]
We propose to use transfer learning which leverages state-of-the-art pre-trained models including wav2vec 2.0 and BERT for this task.
Also, a multi-granularity framework which extracts not only frame-level speech embeddings but also segment-level embeddings including phone, syllable and word-level speech embeddings is proposed to further boost the performance.
arXiv Detail & Related papers (2022-07-11T08:20:53Z) - Improving Multimodal Fusion with Hierarchical Mutual Information
Maximization for Multimodal Sentiment Analysis [16.32509144501822]
We propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs.
The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task.
arXiv Detail & Related papers (2021-09-01T14:45:16Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.