Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion
Recognition
- URL: http://arxiv.org/abs/2207.04697v2
- Date: Tue, 12 Jul 2022 04:21:25 GMT
- Title: Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion
Recognition
- Authors: Zihan Zhao, Yanfeng Wang, Yu Wang
- Abstract summary: We propose to use transfer learning which leverages state-of-the-art pre-trained models including wav2vec 2.0 and BERT for this task.
Also, a multi-granularity framework which extracts not only frame-level speech embeddings but also segment-level embeddings including phone, syllable and word-level speech embeddings is proposed to further boost the performance.
- Score: 15.133202035812017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The research and applications of multimodal emotion recognition have become
increasingly popular recently. However, multimodal emotion recognition faces
the challenge of lack of data. To solve this problem, we propose to use
transfer learning which leverages state-of-the-art pre-trained models including
wav2vec 2.0 and BERT for this task. Multi-level fusion approaches including
coattention-based early fusion and late fusion with the models trained on both
embeddings are explored. Also, a multi-granularity framework which extracts not
only frame-level speech embeddings but also segment-level embeddings including
phone, syllable and word-level speech embeddings is proposed to further boost
the performance. By combining our coattention-based early fusion model and late
fusion model with the multi-granularity feature extraction framework, we obtain
result that outperforms best baseline approaches by 1.3% unweighted accuracy
(UA) on the IEMOCAP dataset.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For
Multimodal Emotion Recognition [24.115771176570824]
We propose to use pretrained models as upstream network, wav2vec 2.0 for audio modality and BERT for text modality.
For the difficulty of multimodal fusion, we use a K-layer multi-head attention mechanism as a downstream fusion module.
We achieve a better performance by 78.42% Weighted Accuracy (WA) and 79.71% Unweighted Accuracy (UA) on the IEMOCAP dataset.
arXiv Detail & Related papers (2023-02-27T10:59:08Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Multilevel Transformer For Multimodal Emotion Recognition [6.0149102420697025]
We introduce a novel multi-granularity framework, which combines fine-grained representation with pre-trained utterance-level representation.
Inspired by Transformer TTS, we propose a multilevel transformer model to perform fine-grained multimodal emotion recognition.
arXiv Detail & Related papers (2022-10-26T10:31:24Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Multistage linguistic conditioning of convolutional layers for speech
emotion recognition [7.482371204083917]
We investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER)
We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN)
Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline.
arXiv Detail & Related papers (2021-10-13T11:28:04Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.