x-vectors meet emotions: A study on dependencies between emotion and
speaker recognition
- URL: http://arxiv.org/abs/2002.05039v1
- Date: Wed, 12 Feb 2020 15:13:07 GMT
- Title: x-vectors meet emotions: A study on dependencies between emotion and
speaker recognition
- Authors: Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, Najim
Dehak
- Abstract summary: We show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning.
For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models.
We present results on the effect of emotion on speaker verification.
- Score: 38.181055783134006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we explore the dependencies between speaker recognition and
emotion recognition. We first show that knowledge learned for speaker
recognition can be reused for emotion recognition through transfer learning.
Then, we show the effect of emotion on speaker recognition. For emotion
recognition, we show that using a simple linear model is enough to obtain good
performance on the features extracted from pre-trained models such as the
x-vector model. Then, we improve emotion recognition performance by fine-tuning
for emotion classification. We evaluated our experiments on three different
types of datasets: IEMOCAP, MSP-Podcast, and Crema-D. By fine-tuning, we
obtained 30.40%, 7.99%, and 8.61% absolute improvement on IEMOCAP, MSP-Podcast,
and Crema-D respectively over baseline model with no pre-training. Finally, we
present results on the effect of emotion on speaker verification. We observed
that speaker verification performance is prone to changes in test speaker
emotions. We found that trials with angry utterances performed worst in all
three datasets. We hope our analysis will initiate a new line of research in
the speaker recognition community.
Related papers
- Prompting Audios Using Acoustic Properties For Emotion Representation [36.275219004598874]
We propose the use of natural language descriptions (or prompts) to better represent emotions.
We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts.
Our results show that the acoustic prompts significantly improve the model's performance in various Precision@K metrics.
arXiv Detail & Related papers (2023-10-03T13:06:58Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Multi-Classifier Interactive Learning for Ambiguous Speech Emotion
Recognition [9.856709988128515]
We propose a novel multi-classifier interactive learning (MCIL) method to address the ambiguous speech emotions.
MCIL mimics several individuals, who have inconsistent cognitions of ambiguous emotions, and construct new ambiguous labels.
Experiments show that MCIL does not only improve each classifier's performance, but also raises their recognition consistency from moderate to substantial.
arXiv Detail & Related papers (2020-12-10T02:58:34Z) - Embedded Emotions -- A Data Driven Approach to Learn Transferable
Feature Representations from Raw Speech Input for Emotion Recognition [1.4556324908347602]
We investigate the applicability of transferring knowledge learned from large text and audio corpora to the task of automatic emotion recognition.
Our results show that the learned feature representations can be effectively applied for classifying emotions from spoken language.
arXiv Detail & Related papers (2020-09-30T09:18:31Z) - Meta Transfer Learning for Emotion Recognition [42.61707533351803]
We propose a PathNet-based transfer learning method that is able to transfer emotional knowledge learned from one visual/audio emotion domain to another visual/audio emotion domain.
Our proposed system is capable of improving the performance of emotion recognition, making its performance substantially superior to the recent proposed fine-tuning/pre-trained models based transfer learning methods.
arXiv Detail & Related papers (2020-06-23T00:25:28Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z) - Detecting Emotion Primitives from Speech and their use in discerning
Categorical Emotions [16.886826928295203]
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
This work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech.
Results indicated that arousal, followed by dominance was a better detector of such emotions.
arXiv Detail & Related papers (2020-01-31T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.