Multi-Task Learning with Auxiliary Speaker Identification for
Conversational Emotion Recognition
- URL: http://arxiv.org/abs/2003.01478v2
- Date: Thu, 5 Mar 2020 01:35:36 GMT
- Title: Multi-Task Learning with Auxiliary Speaker Identification for
Conversational Emotion Recognition
- Authors: Jingye Li, Meishan Zhang, Donghong Ji, Yijiang Liu
- Abstract summary: We exploit speaker identification (SI) as an auxiliary task to enhance the utterance representation in conversations.
By this method, we can learn better speaker-aware contextual representations from the additional SI corpus.
Experiments on two benchmark datasets demonstrate that the proposed architecture is highly effective for CER.
- Score: 32.439818455554885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational emotion recognition (CER) has attracted increasing interests
in the natural language processing (NLP) community. Different from the vanilla
emotion recognition, effective speaker-sensitive utterance representation is
one major challenge for CER. In this paper, we exploit speaker identification
(SI) as an auxiliary task to enhance the utterance representation in
conversations. By this method, we can learn better speaker-aware contextual
representations from the additional SI corpus. Experiments on two benchmark
datasets demonstrate that the proposed architecture is highly effective for
CER, obtaining new state-of-the-art results on two datasets.
Related papers
- CKERC : Joint Large Language Models with Commonsense Knowledge for
Emotion Recognition in Conversation [0.0]
Emotion recognition in conversation (ERC) is a task which predicts the emotion of an utterance in the context of a conversation.
We propose a novel joint large language models with commonsense knowledge framework for emotion recognition in conversation, namely CKERC.
arXiv Detail & Related papers (2024-03-12T02:37:11Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Beyond Isolated Utterances: Conversational Emotion Recognition [33.52961239281893]
Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance.
We propose several approaches for conversational emotion recognition (CER) by treating it as a sequence labeling task.
We investigated transformer architecture for CER and, compared it with ResNet-34 and BiLSTM architectures in both contextual and context-less scenarios.
arXiv Detail & Related papers (2021-09-13T16:40:35Z) - Speaker Attentive Speech Emotion Recognition [11.92436948211501]
Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs)
We present novel work based on the idea of teaching the emotion recognition network about speaker identity.
arXiv Detail & Related papers (2021-04-15T07:59:37Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Speaker-Utterance Dual Attention for Speaker and Utterance Verification [77.2346078109261]
We implement an idea of speaker-utterance dual attention (SUDA) in a unified neural network.
The proposed SUDA features an attention mask mechanism to learn the interaction between the speaker and utterance information streams.
arXiv Detail & Related papers (2020-08-20T11:37:57Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Deep Representation Learning in Speech Processing: Challenges, Recent
Advances, and Future Trends [10.176394550114411]
The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning.
Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech.
arXiv Detail & Related papers (2020-01-02T10:12:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.