NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality
- URL: http://arxiv.org/abs/2205.04421v2
- Date: Tue, 10 May 2022 15:25:20 GMT
- Title: NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality
- Authors: Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi
Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao,
Tie-Yan Liu
- Abstract summary: We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
- Score: 123.97136358092585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text to speech (TTS) has made rapid progress in both academia and industry in
recent years. Some questions naturally arise that whether a TTS system can
achieve human-level quality, how to define/judge that quality and how to
achieve it. In this paper, we answer these questions by first defining the
human-level quality based on the statistical significance of subjective measure
and introducing appropriate guidelines to judge it, and then developing a TTS
system called NaturalSpeech that achieves human-level quality on a benchmark
dataset. Specifically, we leverage a variational autoencoder (VAE) for
end-to-end text to waveform generation, with several key modules to enhance the
capacity of the prior from text and reduce the complexity of the posterior from
speech, including phoneme pre-training, differentiable duration modeling,
bidirectional prior/posterior modeling, and a memory mechanism in VAE.
Experiment evaluations on popular LJSpeech dataset show that our proposed
NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human
recordings at the sentence level, with Wilcoxon signed rank test at p-level p
>> 0.05, which demonstrates no statistically significant difference from human
recordings for the first time on this dataset.
Related papers
- TTSDS -- Text-to-Speech Distribution Score [9.380879437204277]
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech.
We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility.
We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations.
arXiv Detail & Related papers (2024-07-17T16:30:27Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Deep Learning Based Assessment of Synthetic Speech Naturalness [14.463987018380468]
We present a new objective prediction model for synthetic speech naturalness.
It can be used to evaluate Text-To-Speech or Voice Conversion systems.
arXiv Detail & Related papers (2021-04-23T16:05:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.