NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers
- URL: http://arxiv.org/abs/2304.09116v3
- Date: Tue, 30 May 2023 16:09:10 GMT
- Title: NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers
- Authors: Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao
Qin, Sheng Zhao, Jiang Bian
- Abstract summary: We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
- Score: 90.83782600932567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild
datasets is important to capture the diversity in human speech such as speaker
identities, prosodies, and styles (e.g., singing). Current large TTS systems
usually quantize speech into discrete tokens and use language models to
generate these tokens one by one, which suffer from unstable prosody, word
skipping/repeating issue, and poor voice quality. In this paper, we develop
NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual
vector quantizers to get the quantized latent vectors and uses a diffusion
model to generate these latent vectors conditioned on text input. To enhance
the zero-shot capability that is important to achieve diverse speech synthesis,
we design a speech prompting mechanism to facilitate in-context learning in the
diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to
large-scale datasets with 44K hours of speech and singing data and evaluate its
voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS
systems by a large margin in terms of prosody/timbre similarity, robustness,
and voice quality in a zero-shot setting, and performs novel zero-shot singing
synthesis with only a speech prompt. Audio samples are available at
https://speechresearch.github.io/naturalspeech2.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.