Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis
- URL: http://arxiv.org/abs/2011.06465v3
- Date: Sat, 1 May 2021 07:59:07 GMT
- Title: Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis
- Authors: Chung-Ming Chien and Hung-yi Lee
- Abstract summary: We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
- Score: 76.39883780990489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prosody modeling is an essential component in modern text-to-speech (TTS)
frameworks. By explicitly providing prosody features to the TTS model, the
style of synthesized utterances can thus be controlled. However, predicting
natural and reasonable prosody at inference time is challenging. In this work,
we analyzed the behavior of non-autoregressive TTS models under different
prosody-modeling settings and proposed a hierarchical architecture, in which
the prediction of phoneme-level prosody features are conditioned on the
word-level prosody features. The proposed method outperforms other competitors
in terms of audio quality and prosody naturalness in our objective and
subjective evaluation.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns.
Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z) - Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text.
We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality.
We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z) - Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.
Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z) - Hierarchical prosody modeling and control in non-autoregressive parallel
neural TTS [7.531331499935223]
We train a non-autoregressive parallel neural TTS model hierarchically conditioned on coarse and fine-grained acoustic speech features.
Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension.
arXiv Detail & Related papers (2021-10-06T17:58:42Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - FastPitchFormant: Source-filter based Decomposed Modeling for Speech
Synthesis [6.509758931804479]
We propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory.
FastPitchFormant has a unique structure that handles text and acoustic features in parallel.
arXiv Detail & Related papers (2021-06-29T07:06:42Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Phone Features Improve Speech Translation [69.54616570679343]
End-to-end models for speech translation (ST) more tightly couple speech recognition (ASR) and machine translation (MT)
We compare cascaded and end-to-end models across high, medium, and low-resource conditions, and show that cascades remain stronger baselines.
We show that these features improve both architectures, closing the gap between end-to-end models and cascades, and outperforming previous academic work -- by up to 9 BLEU on our low-resource setting.
arXiv Detail & Related papers (2020-05-27T22:05:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.