Hybrid Mutimodal Fusion for Dimensional Emotion Recognition
- URL: http://arxiv.org/abs/2110.08495v1
- Date: Sat, 16 Oct 2021 06:57:18 GMT
- Title: Hybrid Mutimodal Fusion for Dimensional Emotion Recognition
- Authors: Ziyu Ma, Fuyan Ma, Bin Sun, Shutao Li
- Abstract summary: We extensively present our solutions for the MuSe-Stress sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment Challenge (MuSe) 2021.
For the MuSe-Stress sub-challenge, we highlight our solutions in three aspects: 1) the audio-visual features and the bio-signal features are used for emotional state recognition.
For the MuSe-Physio sub-challenge, we first extract the audio-visual features and the bio-signal features from multiple modalities.
- Score: 20.512310175499664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we extensively present our solutions for the MuSe-Stress
sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment
Challenge (MuSe) 2021. The goal of MuSe-Stress sub-challenge is to predict the
level of emotional arousal and valence in a time-continuous manner from
audio-visual recordings and the goal of MuSe-Physio sub-challenge is to predict
the level of psycho-physiological arousal from a) human annotations fused with
b) galvanic skin response (also known as Electrodermal Activity (EDA)) signals
from the stressed people. The Ulm-TSST dataset which is a novel subset of the
audio-visual textual Ulm-Trier Social Stress dataset that features German
speakers in a Trier Social Stress Test (TSST) induced stress situation is used
in both sub-challenges. For the MuSe-Stress sub-challenge, we highlight our
solutions in three aspects: 1) the audio-visual features and the bio-signal
features are used for emotional state recognition. 2) the Long Short-Term
Memory (LSTM) with the self-attention mechanism is utilized to capture complex
temporal dependencies within the feature sequences. 3) the late fusion strategy
is adopted to further boost the model's recognition performance by exploiting
complementary information scattered across multimodal sequences. Our proposed
model achieves CCC of 0.6159 and 0.4609 for valence and arousal respectively on
the test set, which both rank in the top 3. For the MuSe-Physio sub-challenge,
we first extract the audio-visual features and the bio-signal features from
multiple modalities. Then, the LSTM module with the self-attention mechanism,
and the Gated Convolutional Neural Networks (GCNN) as well as the LSTM network
are utilized for modeling the complex temporal dependencies in the sequence.
Finally, the late fusion strategy is used. Our proposed method also achieves
CCC of 0.5412 on the test set, which ranks in the top 3.
Related papers
- RigLSTM: Recurrent Independent Grid LSTM for Generalizable Sequence
Learning [75.61681328968714]
We propose recurrent independent Grid LSTM (RigLSTM) to exploit the underlying modular structure of the target task.
Our model adopts cell selection, input feature selection, hidden state selection, and soft state updating to achieve a better generalization ability.
arXiv Detail & Related papers (2023-11-03T07:40:06Z) - The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked
Emotions, Cross-Cultural Humour, and Personalisation [69.13075715686622]
MuSe 2023 is a set of shared tasks addressing three different contemporary multimodal affect and sentiment analysis problems.
MuSe 2023 seeks to bring together a broad audience from different research communities.
arXiv Detail & Related papers (2023-05-05T08:53:57Z) - A Multimodal Approach for Dementia Detection from Spontaneous Speech
with Tensor Fusion Layer [0.0]
Alzheimer's disease (AD) is a progressive neurological disorder, which affects memory, thinking skills, and mental abilities.
We propose deep neural networks, which can be trained in an end-to-end trainable way and capture the inter- and intra-modal interactions.
arXiv Detail & Related papers (2022-11-08T16:43:58Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Hybrid Multimodal Feature Extraction, Mining and Fusion for Sentiment
Analysis [31.097398034974436]
We present our solutions for the Multimodal Sentiment Analysis Challenge (MuSe) 2022, which includes MuSe-Humor, MuSe-Reaction and MuSe-Stress Sub-challenges.
The MuSe 2022 focuses on humor detection, emotional reactions and multimodal emotional stress utilising different modalities and data sets.
arXiv Detail & Related papers (2022-08-05T09:07:58Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z) - $M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild [86.40973759048957]
This report describes a multi-modal multi-task ($M3$T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge.
In the proposed $M3$T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal.
We evaluated the $M3$T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.
arXiv Detail & Related papers (2020-02-07T18:53:13Z) - The Deterministic plus Stochastic Model of the Residual Signal and its
Applications [13.563526970105988]
This manuscript presents a Deterministic plus Model (DSM) of the residual signal.
The applicability of the DSM in two fields of speech processing is then studied.
arXiv Detail & Related papers (2019-12-29T07:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.