Speech Emotion Detection Based on MFCC and CNN-LSTM Architecture
- URL: http://arxiv.org/abs/2501.10666v1
- Date: Sat, 18 Jan 2025 06:15:54 GMT
- Title: Speech Emotion Detection Based on MFCC and CNN-LSTM Architecture
- Authors: Qianhe Ouyang,
- Abstract summary: This paper processes the initial audio input into waveplot and spectrum for analysis and concentrates on multiple features including MFCC as targets for feature extraction.
The architecture achieved an accuracy of 61.07% comprehensively for the test set, among which the detection of anger and neutral reaches a performance of 75.31% and 71.70% respectively.
- Score: 0.0
- License:
- Abstract: Emotion detection techniques have been applied to multiple cases mainly from facial image features and vocal audio features, of which the latter aspect is disputed yet not only due to the complexity of speech audio processing but also the difficulties of extracting appropriate features. Part of the SAVEE and RAVDESS datasets are selected and combined as the dataset, containing seven sorts of common emotions (i.e. happy, neutral, sad, anger, disgust, fear, and surprise) and thousands of samples. Based on the Librosa package, this paper processes the initial audio input into waveplot and spectrum for analysis and concentrates on multiple features including MFCC as targets for feature extraction. The hybrid CNN-LSTM architecture is adopted by virtue of its strong capability to deal with sequential data and time series, which mainly consists of four convolutional layers and three long short-term memory layers. As a result, the architecture achieved an accuracy of 61.07% comprehensively for the test set, among which the detection of anger and neutral reaches a performance of 75.31% and 71.70% respectively. It can also be concluded that the classification accuracy is dependent on the properties of emotion to some extent, with frequently-used and distinct-featured emotions having less probability to be misclassified into other categories. Emotions like surprise whose meaning depends on the specific context are more likely to confuse with positive or negative emotions, and negative emotions also have a possibility to get mixed with each other.
Related papers
- Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction.
We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics.
We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z) - learning discriminative features from spectrograms using center loss for speech emotion recognition [62.13177498013144]
We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition.
The softmax cross-entropy loss enables features from different emotion categories separable, and the center loss efficiently pulls the features belonging to the same emotion category to their center.
arXiv Detail & Related papers (2025-01-02T06:52:28Z) - Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.
Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.
We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z) - Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning [55.127202990679976]
We introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories.
This dataset enables models to learn from varied scenarios and generalize to real-world applications.
We propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders.
arXiv Detail & Related papers (2024-06-17T03:01:22Z) - Construction and Evaluation of Mandarin Multimodal Emotional Speech
Database [0.0]
The validity of dimension annotation is verified by statistical analysis of dimension annotation data.
The recognition rate of seven emotions is about 82% when using acoustic data alone.
The database is of high quality and can be used as an important source for speech analysis research.
arXiv Detail & Related papers (2024-01-14T17:56:36Z) - An Extended Variational Mode Decomposition Algorithm Developed Speech
Emotion Recognition Performance [15.919990281329085]
This study proposes VGG-optiVMD, an empowered variational mode decomposition algorithm, to distinguish meaningful speech features.
Various feature vectors were employed to train the VGG16 network on different databases and assess VGG-optiVMD and reliability.
Results confirmed a synergistic relationship between the fine-tuning of the signal sample rate and decomposition parameters with classification accuracy.
arXiv Detail & Related papers (2023-12-18T05:24:03Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked
Emotions, Cross-Cultural Humour, and Personalisation [69.13075715686622]
MuSe 2023 is a set of shared tasks addressing three different contemporary multimodal affect and sentiment analysis problems.
MuSe 2023 seeks to bring together a broad audience from different research communities.
arXiv Detail & Related papers (2023-05-05T08:53:57Z) - Emotional Expression Detection in Spoken Language Employing Machine
Learning Algorithms [0.0]
There are a variety of features of the human voice that can be classified as pitch, timbre, loudness, and vocal tone.
It is observed in numerous incidents that human expresses their feelings using different vocal qualities when they are speaking.
The primary objective of this research is to recognize different emotions of human beings by using several functions namely, spectral descriptors, periodicity, and harmonicity.
arXiv Detail & Related papers (2023-04-20T17:57:08Z) - Feature Selection Enhancement and Feature Space Visualization for
Speech-Based Emotion Recognition [2.223733768286313]
We present speech features enhancement strategy that improves speech emotion recognition.
The strategy is compared with the state-of-the-art methods used in the literature.
Our method achieved an average recognition gain of 11.5% for six out of seven emotions for the EMO-DB dataset, and 13.8% for seven out of eight emotions for the RAVDESS dataset.
arXiv Detail & Related papers (2022-08-19T11:29:03Z) - Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli.
Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process.
We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.