Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability
- URL: http://arxiv.org/abs/2104.01408v1
- Date: Sat, 3 Apr 2021 13:52:47 GMT
- Title: Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability
- Authors: Rui Liu, Berrak Sisman, Haizhou Li
- Abstract summary: Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
- Score: 82.39099867188547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotional text-to-speech synthesis (ETTS) has seen much progress in recent
years. However, the generated voice is often not perceptually identifiable by
its intended emotion category. To address this problem, we propose a new
interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to
directly improve the emotion discriminability by interacting with a speech
emotion recognition (SER) model. Moreover, we formulate an iterative training
strategy with reinforcement learning to ensure the quality of i-ETTS
optimization. Experimental results demonstrate that the proposed i-ETTS
outperforms the state-of-the-art baselines by rendering speech with more
accurate emotion style. To our best knowledge, this is the first study of
reinforcement learning in emotional text-to-speech synthesis.
Related papers
- Exploring speech style spaces with language models: Emotional TTS without emotion labels [8.288443063900825]
We propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts.
We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs.
arXiv Detail & Related papers (2024-05-18T23:21:39Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Learning Emotional Representations from Imbalanced Speech Data for
Speech Emotion Recognition and Emotional Text-to-Speech [1.4986031916712106]
Speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks.
Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations.
We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets.
arXiv Detail & Related papers (2023-06-09T07:04:56Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - Detecting Emotion Primitives from Speech and their use in discerning
Categorical Emotions [16.886826928295203]
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
This work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech.
Results indicated that arousal, followed by dominance was a better detector of such emotions.
arXiv Detail & Related papers (2020-01-31T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.