DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard
Challenge 2021
- URL: http://arxiv.org/abs/2110.12612v1
- Date: Mon, 25 Oct 2021 02:47:59 GMT
- Title: DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard
Challenge 2021
- Authors: Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen, Bohan Li, Xu Tan,
Jinzhu Li, Lei He, Sheng Zhao
- Abstract summary: This paper describes the Microsoft end-to-end neural text to speech (TTS) system: DelightfulTTS for Blizzard Challenge 2021.
The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives.
- Score: 31.750875486806184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the Microsoft end-to-end neural text to speech (TTS)
system: DelightfulTTS for Blizzard Challenge 2021. The goal of this challenge
is to synthesize natural and high-quality speech from text, and we approach
this goal in two perspectives: The first is to directly model and generate
waveform in 48 kHz sampling rate, which brings higher perception quality than
previous systems with 16 kHz or 24 kHz sampling rate; The second is to model
the variation information in speech through a systematic design, which improves
the prosody and naturalness. Specifically, for 48 kHz modeling, we predict 16
kHz mel-spectrogram in acoustic model, and propose a vocoder called HiFiNet to
directly generate 48 kHz waveform from predicted 16 kHz mel-spectrogram, which
can better trade off training efficiency, modelling stability and voice
quality. We model variation information systematically from both explicit
(speaker ID, language ID, pitch and duration) and implicit (utterance-level and
phoneme-level prosody) perspectives: 1) For speaker and language ID, we use
lookup embedding in training and inference; 2) For pitch and duration, we
extract the values from paired text-speech data in training and use two
predictors to predict the values in inference; 3) For utterance-level and
phoneme-level prosody, we use two reference encoders to extract the values in
training, and use two separate predictors to predict the values in inference.
Additionally, we introduce an improved Conformer block to better model the
local and global dependency in acoustic model. For task SH1, DelightfulTTS
achieves 4.17 mean score in MOS test and 4.35 in SMOS test, which indicates the
effectiveness of our proposed system
Related papers
- Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - Deep Learning Based Assessment of Synthetic Speech Naturalness [14.463987018380468]
We present a new objective prediction model for synthetic speech naturalness.
It can be used to evaluate Text-To-Speech or Voice Conversion systems.
arXiv Detail & Related papers (2021-04-23T16:05:20Z) - A Comparison of Discrete Latent Variable Models for Speech
Representation Learning [46.52258734975676]
This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal.
Results show that future time-step prediction with vq-wav2vec achieves better performance.
arXiv Detail & Related papers (2020-10-24T01:22:14Z) - Vector-quantized neural networks for acoustic unit discovery in the
ZeroSpeech 2020 challenge [26.114011076658237]
We propose two neural models to tackle the problem of learning discrete representations of speech.
The first model is a type of vector-quantized variational autoencoder (VQ-VAE)
The second model combines vector quantization with contrastive predictive coding (VQ-CPC)
We evaluate the models on English and Indonesian data for the ZeroSpeech 2020 challenge.
arXiv Detail & Related papers (2020-05-19T13:06:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.