BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
on 100K hours of data
- URL: http://arxiv.org/abs/2402.08093v2
- Date: Thu, 15 Feb 2024 18:57:26 GMT
- Title: BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
on 100K hours of data
- Authors: Mateusz {\L}ajszczak, Guillermo C\'ambara, Yang Li, Fatih Beyhan,
Arent van Korlaar, Fan Yang, Arnaud Joly, \'Alvaro Mart\'in-Cortinas, Ammar
Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszy\'nska, Haohan
Guo, Bartosz Putrycz, Soledad L\'opez Gambino, Kayeon Yoo, Elena Sokolova,
Thomas Drugman
- Abstract summary: BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data.
We show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.
- Score: 15.447206120523356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for
$\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with
$\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date,
trained on 100K hours of public domain speech data, achieving a new
state-of-the-art in speech naturalness. It deploys a 1-billion-parameter
autoregressive Transformer that converts raw texts into discrete codes
("speechcodes") followed by a convolution-based decoder which converts these
speechcodes into waveforms in an incremental, streamable manner. Further, our
speechcodes are built using a novel speech tokenization technique that features
speaker ID disentanglement and compression with byte-pair encoding. Echoing the
widely-reported "emergent abilities" of large language models when trained on
increasing volume of data, we show that BASE TTS variants built with 10K+ hours
and 500M+ parameters begin to demonstrate natural prosody on textually complex
sentences. We design and share a specialized dataset to measure these emergent
abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE
TTS by evaluating against baselines that include publicly available large-scale
text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated
by the model can be heard at https://amazon-ltts-paper.com/.
Related papers
- Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [11.62674351793]
We introduce a novel audio-based TTS model to adapt context features with multiple enhancements.
Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer.
Our proposed method outperforms baselines across various context TTS scenarios.
arXiv Detail & Related papers (2024-06-06T03:06:45Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.