Emotion Recognition in Audio and Video Using Deep Neural Networks
- URL: http://arxiv.org/abs/2006.08129v1
- Date: Mon, 15 Jun 2020 04:50:18 GMT
- Title: Emotion Recognition in Audio and Video Using Deep Neural Networks
- Authors: Mandeep Singh and Yuan Fang
- Abstract summary: With advancement of deep learning technology there has been significant improvement of speech recognition.
Recognizing emotion from speech is important aspect and with deep learning technology emotion recognition has improved in accuracy and latency.
In this work, we attempt to explore different neural networks to improve accuracy of emotion recognition.
- Score: 9.694548197876868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are able to comprehend information from multiple domains for e.g.
speech, text and visual. With advancement of deep learning technology there has
been significant improvement of speech recognition. Recognizing emotion from
speech is important aspect and with deep learning technology emotion
recognition has improved in accuracy and latency. There are still many
challenges to improve accuracy. In this work, we attempt to explore different
neural networks to improve accuracy of emotion recognition. With different
architectures explored, we find (CNN+RNN) + 3DCNN multi-model architecture
which processes audio spectrograms and corresponding video frames giving
emotion prediction accuracy of 54.0% among 4 emotions and 71.75% among 3
emotions using IEMOCAP[2] dataset.
Related papers
- Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare [0.0]
The process of identifying human emotion and affective states from speech is known as speech emotion recognition (SER)
My research seeks to use the Convolutional Neural Network (CNN) to distinguish emotions from audio recordings and label them in accordance with the range of different emotions.
I have developed a machine learning model to identify emotions from supplied audio files with the aid of machine learning methods.
arXiv Detail & Related papers (2024-06-15T21:33:03Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition.
Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Recognition In Persian Speech Using Deep Neural Networks [0.0]
Speech Emotion Recognition (SER) is of great importance in Human-Computer Interaction (HCI)
In this article, we examine various deep learning techniques on the SheEMO dataset.
arXiv Detail & Related papers (2022-04-28T16:02:05Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency
Call Centers Data Recordings [0.0]
End-to-end deep learning systems for speech emotion recognition now achieve equivalent or even better results than conventional machine learning approaches.
We first trained and tested it on the widely used corpus accessible by the community, IEMOCAP.
We then used the same architecture as the real life corpus, CEMO, composed of 440 dialogs (2h16m) from 485 speakers.
arXiv Detail & Related papers (2021-10-28T08:56:57Z) - Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction.
To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network.
Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z) - Continuous Emotion Recognition with Spatiotemporal Convolutional Neural
Networks [82.54695985117783]
We investigate the suitability of state-of-the-art deep learning architectures for continuous emotion recognition using long video sequences captured in-the-wild.
We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning.
arXiv Detail & Related papers (2020-11-18T13:42:05Z) - Emotion Recognition System from Speech and Visual Information based on
Convolutional Neural Networks [6.676572642463495]
We propose a system that is able to recognize emotions with a high accuracy rate and in real time.
In order to increase the accuracy of the recognition system, we analyze also the speech data and fuse the information coming from both sources.
arXiv Detail & Related papers (2020-02-29T22:09:46Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.