FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2109.07916v1
- Date: Wed, 15 Sep 2021 05:03:24 GMT
- Title: FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition
- Authors: Bonaventure F. P. Dossou and Yeno K. S. Gbenou
- Abstract summary: We introduce FSER, a speech emotion recognition model trained on four valid speech databases.
On each benchmark dataset, FSER outperforms the best models introduced so far, achieving a state-of-the-art performance.
FSER could potentially be used to improve mental and emotional health care.
- Score: 0.015863809575305417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using mel-spectrograms over conventional MFCCs features, we assess the
abilities of convolutional neural networks to accurately recognize and classify
emotions from speech data. We introduce FSER, a speech emotion recognition
model trained on four valid speech databases, achieving a high-classification
accuracy of 95,05\%, over 8 different emotion classes: anger, anxiety, calm,
disgust, happiness, neutral, sadness, surprise. On each benchmark dataset, FSER
outperforms the best models introduced so far, achieving a state-of-the-art
performance. We show that FSER stays reliable, independently of the language,
sex identity, and any other external factor. Additionally, we describe how FSER
could potentially be used to improve mental and emotional health care and how
our analysis and findings serve as guidelines and benchmarks for further works
in the same direction.
Related papers
- Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT [0.0]
We study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice.
The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB.
arXiv Detail & Related papers (2024-11-05T10:06:40Z) - Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs [2.8728982844941178]
Speech Emotion Recognition (SER) focuses on identifying emotional states from spoken language.
We propose a novel approach that first refines all available transcriptions to ensure data reliability.
We then segment each complete conversation into smaller dialogues and use these dialogues as context to predict the emotion of the target utterance within the dialogue.
arXiv Detail & Related papers (2024-10-27T04:23:34Z) - Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare [0.0]
The process of identifying human emotion and affective states from speech is known as speech emotion recognition (SER)
My research seeks to use the Convolutional Neural Network (CNN) to distinguish emotions from audio recordings and label them in accordance with the range of different emotions.
I have developed a machine learning model to identify emotions from supplied audio files with the aid of machine learning methods.
arXiv Detail & Related papers (2024-06-15T21:33:03Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Attribute Inference Attack of Speech Emotion Recognition in Federated
Learning Settings [56.93025161787725]
Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing local data.
We propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters.
We show that the attribute inference attack is achievable for SER systems trained using FL.
arXiv Detail & Related papers (2021-12-26T16:50:42Z) - StrengthNet: Deep Learning-based Emotion Strength Assessment for
Emotional Speech Synthesis [82.39099867188547]
We propose a deep learning based emotion strength assessment network for strength prediction that is referred to as StrengthNet.
Our model conforms to a multi-task learning framework with a structure that includes an acoustic encoder, a strength predictor and an auxiliary emotion predictor.
Experiments show that the predicted emotion strength of the proposed StrengthNet are highly correlated with ground truth scores for seen and unseen speech.
arXiv Detail & Related papers (2021-10-07T03:16:15Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z) - Detecting Emotion Primitives from Speech and their use in discerning
Categorical Emotions [16.886826928295203]
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
This work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech.
Results indicated that arousal, followed by dominance was a better detector of such emotions.
arXiv Detail & Related papers (2020-01-31T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.