One TTS Alignment To Rule Them All
- URL: http://arxiv.org/abs/2108.10447v1
- Date: Mon, 23 Aug 2021 23:45:48 GMT
- Title: One TTS Alignment To Rule Them All
- Authors: Rohan Badlani, Adrian {\L}ancucki, Kevin J. Shih, Rafael Valle, Wei
Ping, Bryan Catanzaro
- Abstract summary: Speech-to-text alignment is a critical component of neural textto-speech (TTS) models.
In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework.
The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior.
- Score: 26.355019468082247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-to-text alignment is a critical component of neural textto-speech
(TTS) models. Autoregressive TTS models typically use an attention mechanism to
learn these alignments on-line. However, these alignments tend to be brittle
and often fail to generalize to long utterances and out-of-domain text, leading
to missing or repeating words. Most non-autoregressive endto-end TTS models
rely on durations extracted from external sources. In this paper we leverage
the alignment mechanism proposed in RAD-TTS as a generic alignment learning
framework, easily applicable to a variety of neural TTS models. The framework
combines forward-sum algorithm, the Viterbi algorithm, and a simple and
efficient static prior. In our experiments, the alignment learning framework
improves all tested TTS architectures, both autoregressive (Flowtron, Tacotron
2) and non-autoregressive (FastPitch, FastSpeech 2, RAD-TTS). Specifically, it
improves alignment convergence speed of existing attention-based mechanisms,
simplifies the training pipeline, and makes the models more robust to errors on
long utterances. Most importantly, the framework improves the perceived speech
synthesis quality, as judged by human evaluators.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech [9.982121768809854]
We introduce enhancements aimed at AR Transformer-based encoder-decoder text-to-speech systems.
Our approach uses an alignment mechanism to provide cross-attention operations with relative location information.
A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system.
arXiv Detail & Related papers (2024-10-29T16:17:01Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders.
Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations.
We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic
Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer.
We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale
Text-to-Image Synthesis [54.39789900854696]
StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
It significantly improves over previous GANs and outperforms distilled diffusion models in terms of sample quality and speed.
arXiv Detail & Related papers (2023-01-23T16:05:45Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech
Synthesis [79.1885389845874]
Transformer-based end-to-end text-to-speech synthesis (TTS) is one of such successful implementations.
We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework.
Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
arXiv Detail & Related papers (2020-10-23T14:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.