End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency
Call Centers Data Recordings
- URL: http://arxiv.org/abs/2110.14957v1
- Date: Thu, 28 Oct 2021 08:56:57 GMT
- Title: End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency
Call Centers Data Recordings
- Authors: Th\'eo Deschamps-Berger (LISN, CNRS), Lori Lamel (LISN, CNRS),
Laurence Devillers (LISN, CNRS, SU)
- Abstract summary: End-to-end deep learning systems for speech emotion recognition now achieve equivalent or even better results than conventional machine learning approaches.
We first trained and tested it on the widely used corpus accessible by the community, IEMOCAP.
We then used the same architecture as the real life corpus, CEMO, composed of 440 dialogs (2h16m) from 485 speakers.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing a speaker's emotion from their speech can be a key element in
emergency call centers. End-to-end deep learning systems for speech emotion
recognition now achieve equivalent or even better results than conventional
machine learning approaches. In this paper, in order to validate the
performance of our neural network architecture for emotion recognition from
speech, we first trained and tested it on the widely used corpus accessible by
the community, IEMOCAP. We then used the same architecture as the real life
corpus, CEMO, composed of 440 dialogs (2h16m) from 485 speakers. The most
frequent emotions expressed by callers in these real life emergency dialogues
are fear, anger and positive emotions such as relief. In the IEMOCAP general
topic conversations, the most frequent emotions are sadness, anger and
happiness. Using the same end-to-end deep learning architecture, an Unweighted
Accuracy Recall (UA) of 63% is obtained on IEMOCAP and a UA of 45.6% on CEMO,
each with 4 classes. Using only 2 classes (Anger, Neutral), the results for
CEMO are 76.9% UA compared to 81.1% UA for IEMOCAP. We expect that these
encouraging results with CEMO can be improved by combining the audio channel
with the linguistic channel. Real-life emotions are clearly more complex than
acted ones, mainly due to the large diversity of emotional expressions of
speakers. Index Terms-emotion detection, end-to-end deep learning architecture,
call center, real-life database, complex emotions.
Related papers
- Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare [0.0]
The process of identifying human emotion and affective states from speech is known as speech emotion recognition (SER)
My research seeks to use the Convolutional Neural Network (CNN) to distinguish emotions from audio recordings and label them in accordance with the range of different emotions.
I have developed a machine learning model to identify emotions from supplied audio files with the aid of machine learning methods.
arXiv Detail & Related papers (2024-06-15T21:33:03Z) - Think out Loud: Emotion Deducing Explanation in Dialogues [57.90554323226896]
We propose a new task "Emotion Deducing Explanation in Dialogues" (EDEN)
EDEN recognizes emotion and causes in an explicitly thinking way.
It can help Large Language Models (LLMs) achieve better recognition of emotions and causes.
arXiv Detail & Related papers (2024-06-07T08:58:29Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition.
Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - The Role of Phonetic Units in Speech Emotion Recognition [22.64187265473794]
We propose a method for emotion recognition through emotiondependent speech recognition using Wav2vec 2.0.
Models of phonemes, broad phonetic classes, and syllables all significantly outperform the utterance model.
Wav2vec 2.0 can be fine-tuned to recognize coarser-grained or larger phonetic units than phonemes.
arXiv Detail & Related papers (2021-08-02T19:19:47Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Emotion Recognition in Audio and Video Using Deep Neural Networks [9.694548197876868]
With advancement of deep learning technology there has been significant improvement of speech recognition.
Recognizing emotion from speech is important aspect and with deep learning technology emotion recognition has improved in accuracy and latency.
In this work, we attempt to explore different neural networks to improve accuracy of emotion recognition.
arXiv Detail & Related papers (2020-06-15T04:50:18Z) - Detecting Emotion Primitives from Speech and their use in discerning
Categorical Emotions [16.886826928295203]
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
This work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech.
Results indicated that arousal, followed by dominance was a better detector of such emotions.
arXiv Detail & Related papers (2020-01-31T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.