Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
- URL: http://arxiv.org/abs/2410.22179v1
- Date: Tue, 29 Oct 2024 16:17:01 GMT
- Title: Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
- Authors: Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao,
- Abstract summary: We introduce enhancements aimed at AR Transformer-based encoder-decoder text-to-speech systems.
Our approach uses an alignment mechanism to provide cross-attention operations with relative location information.
A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system.
- Score: 9.982121768809854
- License:
- Abstract: Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
Related papers
- Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [19.48653924804823]
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers.
However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech.
We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text.
arXiv Detail & Related papers (2024-06-25T22:18:52Z) - DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders.
Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations.
We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis [7.865191493201841]
Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning.
We propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues.
Our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size.
arXiv Detail & Related papers (2024-06-06T19:48:17Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Pushdown Layers: Encoding Recursive Structure in Transformer Language
Models [86.75729087623259]
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention.
This work introduces Pushdown Layers, a new self-attention layer.
Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization.
arXiv Detail & Related papers (2023-10-29T17:27:18Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - A CTC Alignment-based Non-autoregressive Transformer for End-to-end
Automatic Speech Recognition [26.79184118279807]
We present a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR.
word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs.
We find that CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a 24x inference speedup.
arXiv Detail & Related papers (2023-04-15T18:34:29Z) - Regotron: Regularizing the Tacotron2 architecture via monotonic
alignment loss [71.30589161727967]
We introduce Regotron, a regularized version of Tacotron2, which aims to alleviate the training issues and at the same time produce monotonic alignments.
Our method augments the vanilla Tacotron2 objective function with an additional term, which penalizes non-monotonic alignments in the location-sensitive attention mechanism.
arXiv Detail & Related papers (2022-04-28T12:08:53Z) - One TTS Alignment To Rule Them All [26.355019468082247]
Speech-to-text alignment is a critical component of neural textto-speech (TTS) models.
In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework.
The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior.
arXiv Detail & Related papers (2021-08-23T23:45:48Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.