Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias
- URL: http://arxiv.org/abs/2306.03509v1
- Date: Tue, 6 Jun 2023 08:54:49 GMT
- Title: Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias
- Authors: Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang,
Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
- Abstract summary: Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
- Score: 71.94109664001952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling text-to-speech to a large and wild dataset has been proven to be
highly effective in achieving timbre and speech style generalization,
particularly in zero-shot TTS. However, previous works usually encode speech
into latent using audio codec and use autoregressive language models or
diffusion models to generate it, which ignores the intrinsic nature of speech
and may lead to inferior or uncontrollable results. We argue that speech can be
decomposed into several attributes (e.g., content, timbre, prosody, and phase)
and each of them should be modeled using a module with appropriate inductive
biases. From this perspective, we carefully design a novel and large zero-shot
TTS system called Mega-TTS, which is trained with large-scale wild data and
models different attributes in different ways: 1) Instead of using latent
encoded by audio codec as the intermediate feature, we still choose spectrogram
as it separates the phase and other attributes very well. Phase can be
appropriately constructed by the GAN-based vocoder and does not need to be
modeled by the language model. 2) We model the timbre using global vectors
since timbre is a global attribute that changes slowly over time. 3) We further
use a VQGAN-based acoustic model to generate the spectrogram and a latent code
language model to fit the distribution of prosody, since prosody changes
quickly over time in a sentence, and language models can capture both local and
long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K
hours of speech and evaluate its performance on unseen speakers. Experimental
results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on
zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior
naturalness, robustness, and speaker similarity due to the proper inductive
bias of each module. Audio samples are available at
https://mega-tts.github.io/demo-page.
Related papers
- DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders.
Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations.
We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.