Related papers: Speaker Attentive Speech Emotion Recognition

Speaker Attentive Speech Emotion Recognition

URL: http://arxiv.org/abs/2104.07288v1
Date: Thu, 15 Apr 2021 07:59:37 GMT
Title: Speaker Attentive Speech Emotion Recognition
Authors: Cl\'ement Le Moine, Nicolas Obin and Axel Roebel
Abstract summary: Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs) We present novel work based on the idea of teaching the emotion recognition network about speaker identity.
Score: 11.92436948211501
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs). However, even the most successful methods are still rather failing when adaptation to specific speakers and scenarios is needed, inevitably leading to poorer performances when compared to humans. In this paper, we present novel work based on the idea of teaching the emotion recognition network about speaker identity. Our system is a combination of two ACRNN classifiers respectively dedicated to speaker and emotion recognition. The first informs the latter through a Self Speaker Attention (SSA) mechanism that is shown to considerably help to focus on emotional information of the speech signal. Experiments on social attitudes database Att-HACK and IEMOCAP corpus demonstrate the effectiveness of the proposed method and achieve the state-of-the-art performance in terms of unweighted average recall.

Related papers

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes. Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z)
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. We propose a new interactive training paradigm for ETTS, denoted as i-ETTS. We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
Embedded Emotions -- A Data Driven Approach to Learn Transferable Feature Representations from Raw Speech Input for Emotion Recognition [1.4556324908347602]
We investigate the applicability of transferring knowledge learned from large text and audio corpora to the task of automatic emotion recognition. Our results show that the learned feature representations can be effectively applied for classifying emotions from spoken language.
arXiv Detail & Related papers (2020-09-30T09:18:31Z)
Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations. We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features. We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z)
Multi-Task Learning with Auxiliary Speaker Identification for Conversational Emotion Recognition [32.439818455554885]
We exploit speaker identification (SI) as an auxiliary task to enhance the utterance representation in conversations. By this method, we can learn better speaker-aware contextual representations from the additional SI corpus. Experiments on two benchmark datasets demonstrate that the proposed architecture is highly effective for CER.
arXiv Detail & Related papers (2020-03-03T12:25:03Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.