Related papers: CopyPaste: An Augmentation Method for Speech Emotion Recognition

CopyPaste: An Augmentation Method for Speech Emotion Recognition

URL: http://arxiv.org/abs/2010.14602v2
Date: Thu, 11 Feb 2021 16:04:35 GMT
Title: CopyPaste: An Augmentation Method for Speech Emotion Recognition
Authors: Raghavendra Pappagari, Jes\'us Villalba, Piotr \.Zelasko, Laureano Moro-Velazquez, Najim Dehak
Abstract summary: CopyPaste is a perceptually motivated novel augmentation procedure for speech emotion recognition. Three CopyPaste schemes are tested on two deep learning models. Experiments on noisy test sets suggested that CopyPaste is effective even in noisy test conditions.
Score: 36.61242392144022
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictates a speaker's overall perceived emotion in a recording, concatenation of an emotional (emotion E) and a neutral utterance can still be labeled with emotion E. We hypothesize that SER performance can be improved using these concatenated utterances in model training. To verify this, three CopyPaste schemes are tested on two deep learning models: one trained independently and another using transfer learning from an x-vector model, a speaker recognition model. We observed that all three CopyPaste schemes improve SER performance on all the three datasets considered: MSP-Podcast, Crema-D, and IEMOCAP. Additionally, CopyPaste performs better than noise augmentation and, using them together improves the SER performance further. Our experiments on noisy test sets suggested that CopyPaste is effective even in noisy test conditions.

Related papers

A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition [0.0]
Speech Emotion Recognition (SER) has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning. Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks. We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms on the derived embeddings. The results of our study indicate that the best performance is achieved by algorithms trained on embeddings
arXiv Detail & Related papers (2023-04-22T19:56:35Z)
Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years. We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions. We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z)
Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z)
A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture. We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z)
x-vectors meet emotions: A study on dependencies between emotion and speaker recognition [38.181055783134006]
We show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models. We present results on the effect of emotion on speaker verification.
arXiv Detail & Related papers (2020-02-12T15:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.