Related papers: Beyond Isolated Utterances: Conversational Emotion Recognition

Beyond Isolated Utterances: Conversational Emotion Recognition

URL: http://arxiv.org/abs/2109.06112v1
Date: Mon, 13 Sep 2021 16:40:35 GMT
Title: Beyond Isolated Utterances: Conversational Emotion Recognition
Authors: Raghavendra Pappagari, Piotr \.Zelasko, Jes\'us Villalba, Laureano Moro-Velazquez, Najim Dehak
Abstract summary: Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. We propose several approaches for conversational emotion recognition (CER) by treating it as a sequence labeling task. We investigated transformer architecture for CER and, compared it with ResNet-34 and BiLSTM architectures in both contextual and context-less scenarios.
Score: 33.52961239281893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches for CER by treating it as a sequence labeling task. We investigated transformer architecture for CER and, compared it with ResNet-34 and BiLSTM architectures in both contextual and context-less scenarios using IEMOCAP corpus. Based on the inner workings of the self-attention mechanism, we proposed DiverseCatAugment (DCA), an augmentation scheme, which improved the transformer model performance by an absolute 3.3% micro-f1 on conversations and 3.6% on isolated utterances. We further enhanced the performance by introducing an interlocutor-aware transformer model where we learn a dictionary of interlocutor index embeddings to exploit diarized conversations.

Related papers

GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations [35.63053777817013]
GatedxLSTM is a novel multimodal Emotion Recognition in Conversation (ERC) model. It considers voice and transcripts of both the speaker and their conversational partner to identify the most influential sentences driving emotional shifts. It achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification.
arXiv Detail & Related papers (2025-03-26T18:46:18Z)
ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains [61.50113532215864]
Causal Emotion Entailment (CEE) aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. Current works in CEE mainly focus on modeling semantic and emotional interactions in conversations. We introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations.
arXiv Detail & Related papers (2024-05-17T15:45:08Z)
Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer [78.35816158511523]
We present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT) for simultaneous subject localization and emotion classification. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC.
arXiv Detail & Related papers (2024-04-26T07:30:32Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Context-Dependent Embedding Utterance Representations for Emotion Recognition in Conversations [1.8126187844654875]
We approach Emotion Recognition in Conversations leveraging the conversational context. We propose context-dependent embedding representations of each utterance. The effectiveness of our approach is validated on the open-domain DailyDialog dataset and on the task-oriented EmoWOZ dataset.
arXiv Detail & Related papers (2023-04-17T12:37:57Z)
EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation [34.24557248359872]
We propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task. Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU) and Skip-chain Conditional Random Field (SkipCRF) Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets.
arXiv Detail & Related papers (2023-03-20T13:58:35Z)
deep learning of segment-level feature representation for speech emotion recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances. Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. We propose a new interactive training paradigm for ETTS, denoted as i-ETTS. We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
Discovering Emotion and Reasoning its Flip in Multi-Party Conversations using Masked Memory Network and Transformer [16.224961520924115]
We introduce a novel task -- Emotion Flip Reasoning (EFR) EFR aims to identify past utterances which have triggered one's emotion state to flip at a certain time. We propose a masked memory network to address the former and a Transformer-based network for the latter task.
arXiv Detail & Related papers (2021-03-23T07:42:09Z)
Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles. In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely. We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
Multi-Task Learning with Auxiliary Speaker Identification for Conversational Emotion Recognition [32.439818455554885]
We exploit speaker identification (SI) as an auxiliary task to enhance the utterance representation in conversations. By this method, we can learn better speaker-aware contextual representations from the additional SI corpus. Experiments on two benchmark datasets demonstrate that the proposed architecture is highly effective for CER.
arXiv Detail & Related papers (2020-03-03T12:25:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.