Multiscale Contextual Learning for Speech Emotion Recognition in
Emergency Call Center Conversations
- URL: http://arxiv.org/abs/2308.14894v1
- Date: Mon, 28 Aug 2023 20:31:45 GMT
- Title: Multiscale Contextual Learning for Speech Emotion Recognition in
Emergency Call Center Conversations
- Authors: Th\'eo Deschamps-Berger, Lori Lamel and Laurence Devillers
- Abstract summary: This paper presents a multi-scale conversational context learning approach for speech emotion recognition.
We investigated this approach on both speech transcriptions and acoustic segments.
According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens.
- Score: 4.297070083645049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion recognition in conversations is essential for ensuring advanced
human-machine interactions. However, creating robust and accurate emotion
recognition systems in real life is challenging, mainly due to the scarcity of
emotion datasets collected in the wild and the inability to take into account
the dialogue context. The CEMO dataset, composed of conversations between
agents and patients during emergency calls to a French call center, fills this
gap. The nature of these interactions highlights the role of the emotional flow
of the conversation in predicting patient emotions, as context can often make a
difference in understanding actual feelings. This paper presents a multi-scale
conversational context learning approach for speech emotion recognition, which
takes advantage of this hypothesis. We investigated this approach on both
speech transcriptions and acoustic segments. Experimentally, our method uses
the previous or next information of the targeted segment. In the text domain,
we tested the context window using a wide range of tokens (from 10 to 100) and
at the speech turns level, considering inputs from both the same and opposing
speakers. According to our tests, the context derived from previous tokens has
a more significant influence on accurate prediction than the following tokens.
Furthermore, taking the last speech turn of the same speaker in the
conversation seems useful. In the acoustic domain, we conducted an in-depth
analysis of the impact of the surrounding emotions on the prediction. While
multi-scale conversational context learning using Transformers can enhance
performance in the textual modality for emergency call recordings,
incorporating acoustic context is more challenging.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Revealing Emotional Clusters in Speaker Embeddings: A Contrastive
Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization.
Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters.
We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Dynamic Causal Disentanglement Model for Dialogue Emotion Detection [77.96255121683011]
We propose a Dynamic Causal Disentanglement Model based on hidden variable separation.
This model effectively decomposes the content of dialogues and investigates the temporal accumulation of emotions.
Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and hidden variables.
arXiv Detail & Related papers (2023-09-13T12:58:09Z) - Empirical Interpretation of the Relationship Between Speech Acoustic
Context and Emotion Recognition [28.114873457383354]
Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech.
In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration.
This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach.
arXiv Detail & Related papers (2023-06-30T09:21:48Z) - Emotion Flip Reasoning in Multiparty Conversations [27.884015521888458]
Instigator based Emotion Flip Reasoning (EFR) aims to identify the instigator behind a speaker's emotion flip within a conversation.
We present MELD-I, a dataset that includes ground-truth EFR instigator labels, which are in line with emotional psychology.
We propose a novel neural architecture called TGIF, which leverages Transformer encoders and stacked GRUs to capture the dialogue context.
arXiv Detail & Related papers (2023-06-24T13:22:02Z) - Context-Dependent Embedding Utterance Representations for Emotion
Recognition in Conversations [1.8126187844654875]
We approach Emotion Recognition in Conversations leveraging the conversational context.
We propose context-dependent embedding representations of each utterance.
The effectiveness of our approach is validated on the open-domain DailyDialog dataset and on the task-oriented EmoWOZ dataset.
arXiv Detail & Related papers (2023-04-17T12:37:57Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.