EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks
- URL: http://arxiv.org/abs/2310.12851v1
- Date: Thu, 19 Oct 2023 16:02:53 GMT
- Title: EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks
- Authors: Hanan Hamza, Fiza Gafoor, Fathima Sithara, Gayathri Anil, V. S. Anoop
- Abstract summary: This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the era of advanced artificial intelligence and human-computer
interaction, identifying emotions in spoken language is paramount. This
research explores the integration of deep learning techniques in speech emotion
recognition, offering a comprehensive solution to the challenges associated
with speaker diarization and emotion identification. It introduces a framework
that combines a pre-existing speaker diarization pipeline and an emotion
identification model built on a Convolutional Neural Network (CNN) to achieve
higher precision. The proposed model was trained on data from five speech
emotion datasets, namely, RAVDESS, CREMA-D, SAVEE, TESS, and Movie Clips, out
of which the latter is a speech emotion dataset created specifically for this
research. The features extracted from each sample include Mel Frequency
Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Root Mean Square (RMS),
and various data augmentation algorithms like pitch, noise, stretch, and shift.
This feature extraction approach aims to enhance prediction accuracy while
reducing computational complexity. The proposed model yields an unweighted
accuracy of 63%, demonstrating remarkable efficiency in accurately identifying
emotional states within speech signals.
Related papers
- SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition [17.568724398229232]
Speech emotion recognition (SER) plays an important role in emotional states from deciphering speech signals.
This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER.
It exploits the capabilities of wavelets for effective localization in both time and frequency domains.
arXiv Detail & Related papers (2025-02-01T04:18:06Z) - Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction.
We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics.
We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z) - Improving speaker verification robustness with synthetic emotional utterances [14.63248006004598]
A speaker verification (SV) system offers an authentication service designed to confirm whether a given speech sample originates from a specific speaker.
Previous models exhibit high error rates when dealing with emotional utterances compared to neutral ones.
This issue primarily stems from the limited availability of labeled emotional speech data.
We propose a novel approach employing the CycleGAN framework to serve as a data augmentation method.
arXiv Detail & Related papers (2024-11-30T02:18:26Z) - Learning Speech Emotion Representations in the Quaternion Domain [16.596137913051212]
RH-emo is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms.
RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder.
We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess.
arXiv Detail & Related papers (2022-04-05T17:45:09Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [2.8017924048352576]
We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech.
The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations.
Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
arXiv Detail & Related papers (2021-06-08T06:45:42Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Continuous Emotion Recognition with Spatiotemporal Convolutional Neural
Networks [82.54695985117783]
We investigate the suitability of state-of-the-art deep learning architectures for continuous emotion recognition using long video sequences captured in-the-wild.
We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning.
arXiv Detail & Related papers (2020-11-18T13:42:05Z) - Continuous Emotion Recognition via Deep Convolutional Autoencoder and
Support Vector Regressor [70.2226417364135]
It is crucial that the machine should be able to recognize the emotional state of the user with high accuracy.
Deep neural networks have been used with great success in recognizing emotions.
We present a new model for continuous emotion recognition based on facial expression recognition.
arXiv Detail & Related papers (2020-01-31T17:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.