Leveraged Mel spectrograms using Harmonic and Percussive Components in
Speech Emotion Recognition
- URL: http://arxiv.org/abs/2312.10949v1
- Date: Mon, 18 Dec 2023 05:55:46 GMT
- Title: Leveraged Mel spectrograms using Harmonic and Percussive Components in
Speech Emotion Recognition
- Authors: David Hason Rudd, Huan Huo, Guandong Xu
- Abstract summary: This work explores the effects of the harmonic and percussive components of Mel spectrograms in Speech Emotion Recognition (SER)
We attempt to leverage the Mel spectrogram by decomposing distinguishable acoustic features for exploitation in our proposed architecture.
This study specifically focuses on effective data augmentation techniques for building an enriched hybrid-based feature map.
- Score: 15.919990281329085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech Emotion Recognition (SER) affective technology enables the intelligent
embedded devices to interact with sensitivity. Similarly, call centre employees
recognise customers' emotions from their pitch, energy, and tone of voice so as
to modify their speech for a high-quality interaction with customers. This work
explores, for the first time, the effects of the harmonic and percussive
components of Mel spectrograms in SER. We attempt to leverage the Mel
spectrogram by decomposing distinguishable acoustic features for exploitation
in our proposed architecture, which includes a novel feature map generator
algorithm, a CNN-based network feature extractor and a multi-layer perceptron
(MLP) classifier. This study specifically focuses on effective data
augmentation techniques for building an enriched hybrid-based feature map. This
process results in a function that outputs a 2D image so that it can be used as
input data for a pre-trained CNN-VGG16 feature extractor. Furthermore, we also
investigate other acoustic features such as MFCCs, chromagram, spectral
contrast, and the tonnetz to assess our proposed framework. A test accuracy of
92.79% on the Berlin EMO-DB database is achieved. Our result is higher than
previous works using CNN-VGG16.
Related papers
- Keypoint Description by Symmetry Assessment -- Applications in
Biometrics [49.547569925407814]
We present a model-based feature extractor to describe neighborhoods around keypoints by finite expansion.
The iso-curves of such functions are highly symmetric w.r.t. the origin (a keypoint) and the estimated parameters have well defined geometric interpretations.
arXiv Detail & Related papers (2023-11-03T00:49:25Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker
Embedding and Vision Transformers [0.0]
This paper develops a new learning solution for Speech Emotion Recognition.
It is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding.
Experiments have been performed on several benchmarks in a cross-corpus setting.
arXiv Detail & Related papers (2022-11-04T10:49:44Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Speech Emotion Recognition with Co-Attention based Multi-level Acoustic
Information [21.527784717450885]
Speech Emotion Recognition aims to help the machine to understand human's subjective emotion from only audio information.
We propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
arXiv Detail & Related papers (2022-03-29T08:17:28Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Optimizing Speech Emotion Recognition using Manta-Ray Based Feature
Selection [1.4502611532302039]
We show that concatenation of features, extracted by using different existing feature extraction methods can boost the classification accuracy.
We also perform a novel application of Manta Ray optimization in speech emotion recognition tasks that resulted in a state-of-the-art result.
arXiv Detail & Related papers (2020-09-18T16:09:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.