Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech
- URL: http://arxiv.org/abs/2309.11724v1
- Date: Thu, 21 Sep 2023 01:51:10 GMT
- Title: Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech
- Authors: Rui Liu, Bin Liu, Haizhou Li
- Abstract summary: We propose an emotion-aware prosodic phrasing model, termed textitEmoPP, to mine the emotional cues of utterance accurately and predict appropriate phrase breaks.
We first conduct objective observations on the ESD dataset to validate the strong correlation between emotion and prosodic phrasing.
Then achieves the objective and subjective evaluations show that the EmoPP outperforms all baselines and remarkable performance in terms of emotion expressiveness.
- Score: 47.02518401347879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prosodic phrasing is crucial to the naturalness and intelligibility of
end-to-end Text-to-Speech (TTS). There exist both linguistic and emotional
prosody in natural speech. As the study of prosodic phrasing has been
linguistically motivated, prosodic phrasing for expressive emotion rendering
has not been well studied. In this paper, we propose an emotion-aware prosodic
phrasing model, termed \textit{EmoPP}, to mine the emotional cues of utterance
accurately and predict appropriate phrase breaks. We first conduct objective
observations on the ESD dataset to validate the strong correlation between
emotion and prosodic phrasing. Then the objective and subjective evaluations
show that the EmoPP outperforms all baselines and achieves remarkable
performance in terms of emotion expressiveness. The audio samples and the code
are available at \url{https://github.com/AI-S2-Lab/EmoPP}.
Related papers
- Exploring speech style spaces with language models: Emotional TTS without emotion labels [8.288443063900825]
We propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts.
We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs.
arXiv Detail & Related papers (2024-05-18T23:21:39Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Speech Emotion Diarization: Which Emotion Appears When? [11.84193589275529]
We propose Speech Emotion Diarization (SED) to reflect the fine-grained nature of speech emotions.
Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?"
arXiv Detail & Related papers (2023-06-22T15:47:36Z) - Learning Emotional Representations from Imbalanced Speech Data for
Speech Emotion Recognition and Emotional Text-to-Speech [1.4986031916712106]
Speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks.
Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations.
We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets.
arXiv Detail & Related papers (2023-06-09T07:04:56Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.