A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering
- URL: http://arxiv.org/abs/2507.07046v1
- Date: Wed, 09 Jul 2025 17:07:45 GMT
- Title: A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering
- Authors: Shahana Yasmin Chowdhury, Bithi Banik, Md Tamjidul Hoque, Shreya Banerjee,
- Abstract summary: Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise.<n>The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS.<n>For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.
Related papers
- One Model for All: Universal Pre-training for EEG based Emotion Recognition across Heterogeneous Datasets and Paradigms [9.873322204941394]
'One Model for All' is a universal pre-training framework for EEG analysis across disparate datasets.<n>Our framework achieves new SOTA performance on all within-subject benchmarks: SEED (99.27%), DEAP (93.69%), and DREAMER (93.93%)<n>This work paves the way for more universal, scalable, and effective pre-trained models for diverse EEG analysis tasks.
arXiv Detail & Related papers (2025-11-11T16:46:45Z) - Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z) - Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention [0.5371337604556311]
Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification.<n>We use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception.<n>We propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques.
arXiv Detail & Related papers (2025-07-04T01:55:49Z) - Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy [0.9999629695552196]
The present work develops and validates a data-driven and interpretable machine-learning framework designed to predict strokes.<n>Ten routinely gathered demographic, lifestyle, and clinical variables were sourced from a public cohort of 4,981 records.<n>The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model.
arXiv Detail & Related papers (2025-05-18T21:46:45Z) - Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction.<n>We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics.<n>We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z) - Merging synthetic and real embryo data for advanced AI predictions [69.07284335967019]
We train two generative models using two datasets-one we created and made publicly available, and one existing public dataset-to generate synthetic embryo images at various cell stages.<n>These were combined with real images to train classification models for embryo cell stage prediction.<n>Our results demonstrate that incorporating synthetic images alongside real data improved classification performance, with the model achieving 97% accuracy compared to 94.5% when trained solely on real data.
arXiv Detail & Related papers (2024-12-02T08:24:49Z) - Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images [67.66644395272075]
We present first analysis of state-of-the-art semantic segmentation models when faced with geometric out-of-distribution data.
We propose an augmentation technique called "Organ Transplantation" to enhance generalizability.
Our augmentation technique improves SOA model performance by up to 67 % for RGB data and 90 % for HSI data, achieving performance at the level of in-distribution performance on real OOD test data.
arXiv Detail & Related papers (2024-08-27T19:13:15Z) - Evaluating raw waveforms with deep learning frameworks for speech
emotion recognition [0.0]
We represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage.
We use six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS.
The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in
arXiv Detail & Related papers (2023-07-06T07:27:59Z) - LDEB -- Label Digitization with Emotion Binarization and Machine
Learning for Emotion Recognition in Conversational Dialogues [0.0]
Emotion recognition in conversations (ERC) is vital to the advancements of conversational AI and its applications.
The conversational dialogues present a unique problem where each dialogue depicts nested emotions that entangle the association between the emotional feature descriptors and emotion type (or label)
We proposed a novel approach called Label Digitization with Emotion Binarization (LDEB) that disentangles the twists by utilizing the text normalization and 7-bit digital encoding techniques.
arXiv Detail & Related papers (2023-06-03T20:37:46Z) - A Comparative Study of Data Augmentation Techniques for Deep Learning
Based Emotion Recognition [11.928873764689458]
We conduct a comprehensive evaluation of popular deep learning approaches for emotion recognition.
We show that long-range dependencies in the speech signal are critical for emotion recognition.
Speed/rate augmentation offers the most robust performance gain across models.
arXiv Detail & Related papers (2022-11-09T17:27:03Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.