Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation
- URL: http://arxiv.org/abs/2108.02510v2
- Date: Sun, 8 Aug 2021 19:53:52 GMT
- Title: Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation
- Authors: Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha, Ram D. Sriram
- Abstract summary: Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
- Score: 56.264157127549446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic speech emotion recognition (SER) is a challenging task that plays a
crucial role in natural human-computer interaction. One of the main challenges
in SER is data scarcity, i.e., insufficient amounts of carefully labeled data
to build and fully explore complex deep learning models for emotion
classification. This paper aims to address this challenge using a transfer
learning strategy combined with spectrogram augmentation. Specifically, we
propose a transfer learning approach that leverages a pre-trained residual
network (ResNet) model including a statistics pooling layer from speaker
recognition trained using large amounts of speaker-labeled data. The statistics
pooling layer enables the model to efficiently process variable-length input,
thereby eliminating the need for sequence truncation which is commonly used in
SER systems. In addition, we adopt a spectrogram augmentation technique to
generate additional training data samples by applying random time-frequency
masks to log-mel spectrograms to mitigate overfitting and improve the
generalization of emotion recognition models. We evaluate the effectiveness of
our proposed approach on the interactive emotional dyadic motion capture
(IEMOCAP) dataset. Experimental results indicate that the transfer learning and
spectrogram augmentation approaches improve the SER performance, and when
combined achieve state-of-the-art results.
Related papers
- Joint-Embedding Masked Autoencoder for Self-supervised Learning of
Dynamic Functional Connectivity from the Human Brain [18.165807360855435]
Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks.
We introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision.
arXiv Detail & Related papers (2024-03-11T04:49:41Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - A Comparative Study of Data Augmentation Techniques for Deep Learning
Based Emotion Recognition [11.928873764689458]
We conduct a comprehensive evaluation of popular deep learning approaches for emotion recognition.
We show that long-range dependencies in the speech signal are critical for emotion recognition.
Speed/rate augmentation offers the most robust performance gain across models.
arXiv Detail & Related papers (2022-11-09T17:27:03Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Adversarial Imitation Learning with Trajectorial Augmentation and
Correction [61.924411952657756]
We introduce a novel augmentation method which preserves the success of the augmented trajectories.
We develop an adversarial data augmented imitation architecture to train an imitation agent using synthetic experts.
Experiments show that our data augmentation strategy can improve accuracy and convergence time of adversarial imitation.
arXiv Detail & Related papers (2021-03-25T14:49:32Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.