Detecting Emotion Carriers by Combining Acoustic and Lexical
Representations
- URL: http://arxiv.org/abs/2112.06603v1
- Date: Mon, 13 Dec 2021 12:39:53 GMT
- Title: Detecting Emotion Carriers by Combining Acoustic and Lexical
Representations
- Authors: Sebastian P. Bayerl, Aniruddha Tammewar, Korbinian Riedhammer and
Giuseppe Riccardi
- Abstract summary: We focus on Emotion Carriers (EC) defined as the segments that best explain the emotional state of the narrator.
EC can provide a richer representation of the user state to improve natural language understanding.
We leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives.
- Score: 7.225325393598648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personal narratives (PN) - spoken or written - are recollections of facts,
people, events, and thoughts from one's own experience. Emotion recognition and
sentiment analysis tasks are usually defined at the utterance or document
level. However, in this work, we focus on Emotion Carriers (EC) defined as the
segments (speech or text) that best explain the emotional state of the narrator
("loss of father", "made me choose"). Once extracted, such EC can provide a
richer representation of the user state to improve natural language
understanding and dialogue modeling. In previous work, it has been shown that
EC can be identified using lexical features. However, spoken narratives should
provide a richer description of the context and the users' emotional state. In
this paper, we leverage word-based acoustic and textual embeddings as well as
early and late fusion techniques for the detection of ECs in spoken narratives.
For the acoustic word-level representations, we use Residual Neural Networks
(ResNet) pretrained on separate speech emotion corpora and fine-tuned to detect
EC. Experiments with different fusion and system combination strategies show
that late fusion leads to significant improvements for this task.
Related papers
- Revealing Emotional Clusters in Speaker Embeddings: A Contrastive
Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization.
Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters.
We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z) - Multiscale Contextual Learning for Speech Emotion Recognition in
Emergency Call Center Conversations [4.297070083645049]
This paper presents a multi-scale conversational context learning approach for speech emotion recognition.
We investigated this approach on both speech transcriptions and acoustic segments.
According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens.
arXiv Detail & Related papers (2023-08-28T20:31:45Z) - Mimicking the Thinking Process for Emotion Recognition in Conversation
with Prompts and Paraphrasing [26.043447749659478]
We propose a novel framework which mimics the thinking process when modeling complex factors.
We first comprehend the conversational context with a history-oriented prompt to selectively gather information from predecessors of the target utterance.
We then model the speaker's background with an experience-oriented prompt to retrieve the similar utterances from all conversations.
arXiv Detail & Related papers (2023-06-11T06:36:19Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z) - Emotion Carrier Recognition from Personal Narratives [74.24768079275222]
Personal Narratives (PNs) are recollections of facts, events, and thoughts from one's own experience.
We propose a novel task for Narrative Understanding: Emotion Carrier Recognition (ECR)
arXiv Detail & Related papers (2020-08-17T17:16:08Z) - Annotation of Emotion Carriers in Personal Narratives [69.07034604580214]
We are interested in the problem of understanding personal narratives (PN) - spoken or written - recollections of facts, events, and thoughts.
In PN, emotion carriers are the speech or text segments that best explain the emotional state of the user.
This work proposes and evaluates an annotation model for identifying emotion carriers in spoken personal narratives.
arXiv Detail & Related papers (2020-02-27T15:42:39Z) - A Deep Neural Framework for Contextual Affect Detection [51.378225388679425]
A short and simple text carrying no emotion can represent some strong emotions when reading along with its context.
We propose a Contextual Affect Detection framework which learns the inter-dependence of words in a sentence.
arXiv Detail & Related papers (2020-01-28T05:03:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.