Multistage linguistic conditioning of convolutional layers for speech
emotion recognition
- URL: http://arxiv.org/abs/2110.06650v1
- Date: Wed, 13 Oct 2021 11:28:04 GMT
- Title: Multistage linguistic conditioning of convolutional layers for speech
emotion recognition
- Authors: Andreas Triantafyllopoulos, Uwe Reichel, Shuo Liu, Stephan Huber,
Florian Eyben, Bj\"orn W. Schuller
- Abstract summary: We investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER)
We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN)
Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline.
- Score: 7.482371204083917
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this contribution, we investigate the effectiveness of deep fusion of text
and audio features for categorical and dimensional speech emotion recognition
(SER). We propose a novel, multistage fusion method where the two information
streams are integrated in several layers of a deep neural network (DNN), and
contrast it with a single-stage one where the streams are merged in a single
point. Both methods depend on extracting summary linguistic embeddings from a
pre-trained BERT model, and conditioning one or more intermediate
representations of a convolutional model operating on log-Mel spectrograms.
Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate
that the two fusion methods clearly outperform a shallow (late) fusion baseline
and their unimodal constituents, both in terms of quantitative performance and
qualitative behaviour. Our accompanying analysis further reveals a hitherto
unexplored role of the underlying dialogue acts on unimodal and bimodal SER,
with different models showing a biased behaviour across different acts.
Overall, our multistage fusion shows better quantitative performance,
surpassing all alternatives on most of our evaluations. This illustrates the
potential of multistage fusion in better assimilating text and audio
information.
Related papers
- AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion
Recognition [15.133202035812017]
We propose to use transfer learning which leverages state-of-the-art pre-trained models including wav2vec 2.0 and BERT for this task.
Also, a multi-granularity framework which extracts not only frame-level speech embeddings but also segment-level embeddings including phone, syllable and word-level speech embeddings is proposed to further boost the performance.
arXiv Detail & Related papers (2022-07-11T08:20:53Z) - DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach
for Violence Detection in Conversations [2.8038382295783943]
The choice of words and vocal cues in conversations presents an underexplored rich source of natural language data for personal safety and crime prevention.
We introduce a new information fusion approach that extracts and fuses multi-level features including verbal, vocal, and text as heterogeneous sources of information to detect the extent of violent behaviours in conversations.
arXiv Detail & Related papers (2022-06-23T16:45:50Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - TransModality: An End2End Fusion Method with Transformer for Multimodal
Sentiment Analysis [42.6733747726081]
We propose a new fusion method, TransModality, to address the task of multimodal sentiment analysis.
We validate our model on multiple multimodal datasets: CMU-MOSI, MELD, IEMOCAP.
arXiv Detail & Related papers (2020-09-07T06:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.