Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion
Recognition
- URL: http://arxiv.org/abs/2103.09154v1
- Date: Tue, 16 Mar 2021 15:49:15 GMT
- Title: Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion
Recognition
- Authors: Liam Schoneveld and Alice Othmani and Hazem Abdelkawy
- Abstract summary: Spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis.
We propose a new deep learning-based approach for audio-visual emotion recognition.
- Score: 2.1485350418225244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotional expressions are the behaviors that communicate our emotional state
or attitude to others. They are expressed through verbal and non-verbal
communication. Complex human behavior can be understood by studying physical
features from multiple modalities; mainly facial, vocal and physical gestures.
Recently, spontaneous multi-modal emotion recognition has been extensively
studied for human behavior analysis. In this paper, we propose a new deep
learning-based approach for audio-visual emotion recognition. Our approach
leverages recent advances in deep learning like knowledge distillation and
high-performing deep architectures. The deep feature representations of the
audio and visual modalities are fused based on a model-level fusion strategy. A
recurrent neural network is then used to capture the temporal dynamics. Our
proposed approach substantially outperforms state-of-the-art approaches in
predicting valence on the RECOLA dataset. Moreover, our proposed visual facial
expression feature extraction network outperforms state-of-the-art results on
the AffectNet and Google Facial Expression Comparison datasets.
Related papers
- Emotion Recognition from the perspective of Activity Recognition [0.0]
Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions.
For emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world.
We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism.
arXiv Detail & Related papers (2024-03-24T18:53:57Z) - Enhancing the Prediction of Emotional Experience in Movies using Deep
Neural Networks: The Significance of Audio and Language [0.0]
Our paper focuses on making use of deep neural network models to accurately predict the range of human emotions experienced during watching movies.
In this certain setup, there exist three clear-cut input modalities that considerably influence the experienced emotions: visual cues derived from RGB video frames, auditory components encompassing sounds, speech, and music, and linguistic elements encompassing actors' dialogues.
arXiv Detail & Related papers (2023-06-17T17:40:27Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Multi-Cue Adaptive Emotion Recognition Network [4.570705738465714]
We propose a new deep learning approach for emotion recognition based on adaptive multi-cues.
We compare the proposed approach with the state-of-art approaches in the CAER-S dataset.
arXiv Detail & Related papers (2021-11-03T15:08:55Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction.
To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network.
Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Continuous Emotion Recognition with Spatiotemporal Convolutional Neural
Networks [82.54695985117783]
We investigate the suitability of state-of-the-art deep learning architectures for continuous emotion recognition using long video sequences captured in-the-wild.
We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning.
arXiv Detail & Related papers (2020-11-18T13:42:05Z) - Temporal aggregation of audio-visual modalities for emotion recognition [0.5352699766206808]
We propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality.
Our proposed method outperforms other methods from the literature and human accuracy rating.
arXiv Detail & Related papers (2020-07-08T18:44:15Z) - Emotion Recognition System from Speech and Visual Information based on
Convolutional Neural Networks [6.676572642463495]
We propose a system that is able to recognize emotions with a high accuracy rate and in real time.
In order to increase the accuracy of the recognition system, we analyze also the speech data and fuse the information coming from both sources.
arXiv Detail & Related papers (2020-02-29T22:09:46Z) - Continuous Emotion Recognition via Deep Convolutional Autoencoder and
Support Vector Regressor [70.2226417364135]
It is crucial that the machine should be able to recognize the emotional state of the user with high accuracy.
Deep neural networks have been used with great success in recognizing emotions.
We present a new model for continuous emotion recognition based on facial expression recognition.
arXiv Detail & Related papers (2020-01-31T17:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.