Evaluating raw waveforms with deep learning frameworks for speech
emotion recognition
- URL: http://arxiv.org/abs/2307.02820v1
- Date: Thu, 6 Jul 2023 07:27:59 GMT
- Title: Evaluating raw waveforms with deep learning frameworks for speech
emotion recognition
- Authors: Zeynep Hilal Kilimci, Ulku Bayraktar, Ayhan Kucukmanisa
- Abstract summary: We represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage.
We use six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS.
The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech emotion recognition is a challenging task in speech processing field.
For this reason, feature extraction process has a crucial importance to
demonstrate and process the speech signals. In this work, we represent a model,
which feeds raw audio files directly into the deep neural networks without any
feature extraction stage for the recognition of emotions utilizing six
different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To
demonstrate the contribution of proposed model, the performance of traditional
feature extraction techniques namely, mel-scale spectogram, mel-frequency
cepstral coefficients, are blended with machine learning algorithms, ensemble
learning methods, deep and hybrid deep learning techniques. Support vector
machine, decision tree, naive Bayes, random forests models are evaluated as
machine learning algorithms while majority voting and stacking methods are
assessed as ensemble learning techniques. Moreover, convolutional neural
networks, long short-term memory networks, and hybrid CNN- LSTM model are
evaluated as deep learning techniques and compared with machine learning and
ensemble learning methods. To demonstrate the effectiveness of proposed model,
the comparison with state-of-the-art studies are carried out. Based on the
experiment results, CNN model excels existent approaches with 95.86% of
accuracy for TESS+RAVDESS data set using raw audio files, thence determining
the new state-of-the-art. The proposed model performs 90.34% of accuracy for
EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of
accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model,
85.76% of accuracy for SAVEE with CNN model in speaker-independent audio
categorization problems.
Related papers
- Searching for Effective Preprocessing Method and CNN-based Architecture with Efficient Channel Attention on Speech Emotion Recognition [0.0]
Speech emotion recognition (SER) classifies human emotions in speech with a computer model.
We propose a 6-layer convolutional neural network (CNN) model with efficient channel attention (ECA) to pursue an efficient model structure.
With the interactive emotional dyadic motion capture (IEMOCAP) dataset, increasing the frequency resolution in preprocessing emotional speech can improve emotion recognition performance.
arXiv Detail & Related papers (2024-09-06T03:17:25Z) - Modeling & Evaluating the Performance of Convolutional Neural Networks for Classifying Steel Surface Defects [0.0]
Recently, outstanding identification rates in image classification tasks were achieved by convolutional neural networks (CNNs)
DenseNet201 had the greatest detection rate on the NEU dataset, falling in at 98.37 percent.
arXiv Detail & Related papers (2024-06-19T08:14:50Z) - Adaptive Convolutional Dictionary Network for CT Metal Artifact
Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction.
Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image.
Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z) - Speech Emotion Recognition Using Quaternion Convolutional Neural
Networks [1.776746672434207]
This paper proposes a quaternion convolutional neural network (QCNN) based speech emotion recognition model.
Mel-spectrogram features of speech signals are encoded in an RGB quaternion domain.
The model achieves an accuracy of 77.87%, 70.46%, and 88.78% for the RAVDESS, IEMOCAP, and EMO-DB datasets, respectively.
arXiv Detail & Related papers (2021-10-31T04:06:07Z) - Multi-Branch Deep Radial Basis Function Networks for Facial Emotion
Recognition [80.35852245488043]
We propose a CNN based architecture enhanced with multiple branches formed by radial basis function (RBF) units.
RBF units capture local patterns shared by similar instances using an intermediate representation.
We show it is the incorporation of local information what makes the proposed model competitive.
arXiv Detail & Related papers (2021-09-07T21:05:56Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Effects of Number of Filters of Convolutional Layers on Speech
Recognition Model Accuracy [6.2698513174194215]
This paper studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Networks adding to Recurrent Networks) for ASR Models (Automatic Speech Recognition)
Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model.
arXiv Detail & Related papers (2021-02-03T23:04:38Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.