A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition
- URL: http://arxiv.org/abs/2008.02863v2
- Date: Sat, 15 Aug 2020 18:56:24 GMT
- Title: A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition
- Authors: Sitong Zhou and Homayoon Beigi
- Abstract summary: We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a transfer learning method in speech emotion recognition
based on a Time-Delay Neural Network (TDNN) architecture. A major challenge in
the current speech-based emotion detection research is data scarcity. The
proposed method resolves this problem by applying transfer learning techniques
in order to leverage data from the automatic speech recognition (ASR) task for
which ample data is available. Our experiments also show the advantage of
speaker-class adaptation modeling techniques by adopting identity-vector
(i-vector) based features in addition to standard Mel-Frequency Cepstral
Coefficient (MFCC) features.[1] We show the transfer learning models
significantly outperform the other methods without pretraining on ASR. The
experiments performed on the publicly available IEMOCAP dataset which provides
12 hours of motional speech data. The transfer learning was initialized by
using the Ted-Lium v.2 speech dataset providing 207 hours of audio with the
corresponding transcripts. We achieve the highest significantly higher accuracy
when compared to state-of-the-art, using five-fold cross validation. Using only
speech, we obtain an accuracy 71.7% for anger, excitement, sadness, and
neutrality emotion content.
Related papers
- EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Sentiment-Aware Automatic Speech Recognition pre-training for enhanced
Speech Emotion Recognition [11.760166084942908]
We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER)
We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks.
We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data.
arXiv Detail & Related papers (2022-01-27T22:20:28Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition.
With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.