Emotional Prosody Control for Speech Generation
- URL: http://arxiv.org/abs/2111.04730v1
- Date: Sun, 7 Nov 2021 08:52:04 GMT
- Title: Emotional Prosody Control for Speech Generation
- Authors: Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi
- Abstract summary: We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space.
The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
- Score: 7.66200737962746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine-generated speech is characterized by its limited or unnatural
emotional variation. Current text to speech systems generates speech with
either a flat emotion, emotion selected from a predefined set, average
variation learned from prosody sequences in training data or transferred from a
source style. We propose a text to speech(TTS) system, where a user can choose
the emotion of generated speech from a continuous and meaningful emotion space
(Arousal-Valence space). The proposed TTS system can generate speech from the
text in any speaker's style, with fine control of emotion. We show that the
system works on emotion unseen during training and can scale to previously
unseen speakers given his/her speech sample. Our work expands the horizon of
the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives
it much-coveted continuous (and interpretable) affective control, without any
observable degradation in the quality of the synthesized speech.
Related papers
- EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector [26.656512860918262]
EmoSphere++ is an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech.
We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation.
We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps.
arXiv Detail & Related papers (2024-11-04T21:33:56Z) - Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis [3.8251125989631674]
We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system.
It derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech.
Our system showcases competitive inference time performance when benchmarked against state-of-the-art TTS models.
arXiv Detail & Related papers (2024-10-24T23:18:02Z) - Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions [37.075331767703986]
Current emotional text-to-speech systems face challenges in mimicking a broad spectrum of human emotions.
This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance.
It can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training.
arXiv Detail & Related papers (2024-09-25T07:16:16Z) - EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech [0.0]
State-of-the-art speech models try to get as close as possible to the human voice.
Modelling emotions is an essential part of Text-To-Speech (TTS) research.
EmoSpeech surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech.
arXiv Detail & Related papers (2023-06-28T19:34:16Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Emotion Selectable End-to-End Text-based Speech Editing [63.346825713704625]
Emo-CampNet (emotion CampNet) is an emotion-selectable text-based speech editing model.
It can effectively control the emotion of the generated speech in the process of text-based speech editing.
It can also edit unseen speakers' speech.
arXiv Detail & Related papers (2022-12-20T12:02:40Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.