Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling
- URL: http://arxiv.org/abs/2312.11947v1
- Date: Tue, 19 Dec 2023 08:47:50 GMT
- Title: Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling
- Authors: Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
- Abstract summary: Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
- Score: 50.99252242917458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational Speech Synthesis (CSS) aims to accurately express an utterance
with the appropriate prosody and emotional inflection within a conversational
setting. While recognising the significance of CSS task, the prior studies have
not thoroughly investigated the emotional expressiveness problems due to the
scarcity of emotional conversational datasets and the difficulty of stateful
emotion modeling. In this paper, we propose a novel emotional CSS model, termed
ECSS, that includes two main components: 1) to enhance emotion understanding,
we introduce a heterogeneous graph-based emotional context modeling mechanism,
which takes the multi-source dialogue history as input to model the dialogue
context and learn the emotion cues from the context; 2) to achieve emotion
rendering, we employ a contrastive learning-based emotion renderer module to
infer the accurate emotion style for the target utterance. To address the issue
of data scarcity, we meticulously create emotional labels in terms of category
and intensity, and annotate additional emotional information on the existing
conversational dataset (DailyTalk). Both objective and subjective evaluations
suggest that our model outperforms the baseline models in understanding and
rendering emotions. These evaluations also underscore the importance of
comprehensive emotional annotations. Code and audio samples can be found at:
https://github.com/walker-hyf/ECSS.
Related papers
- EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector [26.656512860918262]
EmoSphere++ is an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech.
We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation.
We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps.
arXiv Detail & Related papers (2024-11-04T21:33:56Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised
Representations and Neural Vocoder-based Resynthesis [15.16865739526702]
We introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance.
We then use a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion.
Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion.
arXiv Detail & Related papers (2023-06-02T21:02:51Z) - Chat-Capsule: A Hierarchical Capsule for Dialog-level Emotion Analysis [70.98130990040228]
We propose a Context-based Hierarchical Attention Capsule(Chat-Capsule) model, which models both utterance-level and dialog-level emotions and their interrelations.
On a dialog dataset collected from customer support of an e-commerce platform, our model is also able to predict user satisfaction and emotion curve category.
arXiv Detail & Related papers (2022-03-23T08:04:30Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Contrast and Generation Make BART a Good Dialogue Emotion Recognizer [38.18867570050835]
Long-range contextual emotional relationships with speaker dependency play a crucial part in dialogue emotion recognition.
We adopt supervised contrastive learning to make different emotions mutually exclusive to identify similar emotions better.
We utilize an auxiliary response generation task to enhance the model's ability of handling context information.
arXiv Detail & Related papers (2021-12-21T13:38:00Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Enhancing Cognitive Models of Emotions with Representation Learning [58.2386408470585]
We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions.
Our framework integrates a contextualized embedding encoder with a multi-head probing model.
Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions.
arXiv Detail & Related papers (2021-04-20T16:55:15Z) - Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network
for Emotional Conversation Generation [25.808037796936766]
In a real-world conversation, we instinctively perceive emotions from multi-source information.
We propose a heterogeneous graph-based model for emotional conversation generation.
Experimental results show that our model can effectively perceive emotions from multi-source knowledge.
arXiv Detail & Related papers (2020-12-09T06:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.