An investigation of speaker independent phrase break models in
End-to-End TTS systems
- URL: http://arxiv.org/abs/2304.04157v2
- Date: Fri, 21 Apr 2023 05:03:27 GMT
- Title: An investigation of speaker independent phrase break models in
End-to-End TTS systems
- Authors: Anandaswarup Vadapalli
- Abstract summary: We evaluate the utility and effectiveness of phrase break prediction models in an end-to-end TTS system.
We show by means of perceptual listening evaluations that there is a clear preference for stories synthesized after predicting the location of phrase breaks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our work on phrase break prediction in the context of
end-to-end TTS systems, motivated by the following questions: (i) Is there any
utility in incorporating an explicit phrasing model in an end-to-end TTS
system?, and (ii) How do you evaluate the effectiveness of a phrasing model in
an end-to-end TTS system? In particular, the utility and effectiveness of
phrase break prediction models are evaluated in in the context of childrens
story synthesis, using listener comprehension. We show by means of perceptual
listening evaluations that there is a clear preference for stories synthesized
after predicting the location of phrase breaks using a trained phrasing model,
over stories directly synthesized without predicting the location of phrase
breaks.
Related papers
- Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling [13.757256085713571]
We present a novel two-stage prediction pipeline, named TAP-FM, proposed in this paper.
Specifically, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner.
Our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations.
arXiv Detail & Related papers (2024-04-14T08:56:19Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic
Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer.
We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Duration-aware pause insertion using pre-trained language model for
multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model.
Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus.
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.