ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in
Paragraph-based TTS
- URL: http://arxiv.org/abs/2209.06484v1
- Date: Wed, 14 Sep 2022 08:34:16 GMT
- Title: ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in
Paragraph-based TTS
- Authors: Liumeng Xue, Frank K. Soong, Shaofei Zhang, Lei Xie
- Abstract summary: We propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training.
We trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker.
The proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise.
- Score: 19.988974534582205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in neural end-to-end TTS models have shown high-quality,
natural synthesized speech in a conventional sentence-based TTS. However, it is
still challenging to reproduce similar high quality when a whole paragraph is
considered in TTS, where a large amount of contextual information needs to be
considered in building a paragraph-based TTS model. To alleviate the difficulty
in training, we propose to model linguistic and prosodic information by
considering cross-sentence, embedded structure in training. Three sub-modules,
including linguistics-aware, prosody-aware and sentence-position networks, are
trained together with a modified Tacotron2. Specifically, to learn the
information embedded in a paragraph and the relations among the corresponding
component sentences, we utilize linguistics-aware and prosody-aware networks.
The information in a paragraph is captured by encoders and the inter-sentence
information in a paragraph is learned with multi-head attention mechanisms. The
relative sentence position in a paragraph is explicitly exploited by a
sentence-position network. Trained on a storytelling audio-book corpus (4.08
hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model
demonstrates that it can produce rather natural and good-quality speech
paragraph-wise. The cross-sentence contextual information, such as break and
prosodic variations between consecutive sentences, can be better predicted and
rendered than the sentence-based model. Tested on paragraph texts, of which the
lengths are similar to, longer than, or much longer than the typical paragraph
length of the training data, the TTS speech produced by the new model is
consistently preferred over the sentence-based model in subjective tests and
confirmed in objective measures.
Related papers
- Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects.
We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.