Hierarchical prosody modeling and control in non-autoregressive parallel
neural TTS
- URL: http://arxiv.org/abs/2110.02952v1
- Date: Wed, 6 Oct 2021 17:58:42 GMT
- Title: Hierarchical prosody modeling and control in non-autoregressive parallel
neural TTS
- Authors: Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri
- Abstract summary: We train a non-autoregressive parallel neural TTS model hierarchically conditioned on coarse and fine-grained acoustic speech features.
Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension.
- Score: 7.531331499935223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural text-to-speech (TTS) synthesis can generate speech that is
indistinguishable from natural speech. However, the synthetic speech often
represents the average prosodic style of the database instead of having more
versatile prosodic variation. Moreover, many models lack the ability to control
the output prosody, which does not allow for different styles for the same text
input. In this work, we train a non-autoregressive parallel neural TTS model
hierarchically conditioned on both coarse and fine-grained acoustic speech
features to learn a latent prosody space with intuitive and meaningful
dimensions. Experiments show that a non-autoregressive TTS model hierarchically
conditioned on utterance-wise pitch, pitch range, duration, energy, and
spectral tilt can effectively control each prosodic dimension, generate a wide
variety of speaking styles, and provide word-wise emphasis control, while
maintaining equal or better quality to the baseline model.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Prosody-controllable spontaneous TTS with neural HMMs [11.472325158964646]
We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets.
We add utterance-level prosody control to an existing neural HMM-based TTS system.
We evaluate the system's capability of synthesizing two types of creaky voice.
arXiv Detail & Related papers (2022-11-24T11:06:11Z) - StyleTTS: A Style-Based Generative Model for Natural and Diverse
Text-to-Speech Synthesis [23.17929822987861]
StyleTTS is a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance.
Our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets.
arXiv Detail & Related papers (2022-05-30T21:34:40Z) - Emphasis control for parallel neural TTS [8.039245267912511]
The semantic information conveyed by a speech signal is strongly influenced by local variations in prosody.
Recent parallel neural text-to-speech (TTS) methods are able to generate speech with high fidelity while maintaining high performance.
This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis.
arXiv Detail & Related papers (2021-10-06T18:45:39Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - Controllable neural text-to-speech synthesis using intuitive prosodic
features [3.709803838880226]
We train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions.
Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles.
arXiv Detail & Related papers (2020-09-14T22:37:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.