An investigation of speaker independent phrase break models in
End-to-End TTS systems
- URL: http://arxiv.org/abs/2304.04157v2
- Date: Fri, 21 Apr 2023 05:03:27 GMT
- Title: An investigation of speaker independent phrase break models in
End-to-End TTS systems
- Authors: Anandaswarup Vadapalli
- Abstract summary: We evaluate the utility and effectiveness of phrase break prediction models in an end-to-end TTS system.
We show by means of perceptual listening evaluations that there is a clear preference for stories synthesized after predicting the location of phrase breaks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our work on phrase break prediction in the context of
end-to-end TTS systems, motivated by the following questions: (i) Is there any
utility in incorporating an explicit phrasing model in an end-to-end TTS
system?, and (ii) How do you evaluate the effectiveness of a phrasing model in
an end-to-end TTS system? In particular, the utility and effectiveness of
phrase break prediction models are evaluated in in the context of childrens
story synthesis, using listener comprehension. We show by means of perceptual
listening evaluations that there is a clear preference for stories synthesized
after predicting the location of phrase breaks using a trained phrasing model,
over stories directly synthesized without predicting the location of phrase
breaks.
Related papers
- Word-wise intonation model for cross-language TTS systems [0.0]
The proposed model is suitable for automatic data markup and its extended application to text-to-speech systems.
The key idea is a partial elimination of the variability connected with different placements of a stressed syllable in a word.
The proposed model could be used as a tool for intonation research or as a backbone for prosody description in text-to-speech systems.
arXiv Detail & Related papers (2024-09-30T15:09:42Z) - Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z) - PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z) - ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in
Paragraph-based TTS [19.988974534582205]
We propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training.
We trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker.
The proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise.
arXiv Detail & Related papers (2022-09-14T08:34:16Z) - BERT, can HE predict contrastive focus? Predicting and controlling
prominence in neural TTS using a language model [29.188684861193092]
We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus.
We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
arXiv Detail & Related papers (2022-07-04T20:43:41Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Introducing Syntactic Structures into Target Opinion Word Extraction
with Deep Learning [89.64620296557177]
We propose to incorporate the syntactic structures of the sentences into the deep learning models for targeted opinion word extraction.
We also introduce a novel regularization technique to improve the performance of the deep learning models.
The proposed model is extensively analyzed and achieves the state-of-the-art performance on four benchmark datasets.
arXiv Detail & Related papers (2020-10-26T07:13:17Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.