Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition
- URL: http://arxiv.org/abs/2309.12714v1
- Date: Fri, 22 Sep 2023 08:54:06 GMT
- Title: Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition
- Authors: Amirali Soltani Tehrani, Niloufar Faridani, Ramin Toosi
- Abstract summary: This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
- Score: 1.3812010983144798
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Speech Emotion Recognition (SER) plays a pivotal role in enhancing
human-computer interaction by enabling a deeper understanding of emotional
states across a wide range of applications, contributing to more empathetic and
effective communication. This study proposes an innovative approach that
integrates self-supervised feature extraction with supervised classification
for emotion recognition from small audio segments. In the preprocessing step,
to eliminate the need of crafting audio features, we employed a self-supervised
feature extractor, based on the Wav2Vec model, to capture acoustic features
from audio data. Then, the output featuremaps of the preprocessing step are fed
to a custom designed Convolutional Neural Network (CNN)-based model to perform
emotion classification. Utilizing the ShEMO dataset as our testing ground, the
proposed method surpasses two baseline methods, i.e. support vector machine
classifier and transfer learning of a pretrained CNN. comparing the propose
method to the state-of-the-art methods in SER task indicates the superiority of
the proposed method. Our findings underscore the pivotal role of deep
unsupervised feature learning in elevating the landscape of SER, offering
enhanced emotional comprehension in the realm of human-computer interactions.
Related papers
- Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT [0.0]
We study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice.
The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB.
arXiv Detail & Related papers (2024-11-05T10:06:40Z) - Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences [4.740624855896404]
We propose a contrastive learning framework utilizing selective strong augmentation for self-supervised gait-based emotion representation.
Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
arXiv Detail & Related papers (2024-05-08T09:13:10Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Continuous Emotion Recognition via Deep Convolutional Autoencoder and
Support Vector Regressor [70.2226417364135]
It is crucial that the machine should be able to recognize the emotional state of the user with high accuracy.
Deep neural networks have been used with great success in recognizing emotions.
We present a new model for continuous emotion recognition based on facial expression recognition.
arXiv Detail & Related papers (2020-01-31T17:47:16Z) - Deep Representation Learning in Speech Processing: Challenges, Recent
Advances, and Future Trends [10.176394550114411]
The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning.
Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech.
arXiv Detail & Related papers (2020-01-02T10:12:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.