FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
- URL: http://arxiv.org/abs/2006.04558v8
- Date: Mon, 8 Aug 2022 01:53:05 GMT
- Title: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
- Authors: Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
- Abstract summary: Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality.
FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss.
We propose FastSpeech 2, which directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch,
- Score: 189.05831125931053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive text to speech (TTS) models such as FastSpeech can
synthesize speech significantly faster than previous autoregressive models with
comparable quality. The training of FastSpeech model relies on an
autoregressive teacher model for duration prediction (to provide more
information as input) and knowledge distillation (to simplify the data
distribution in output), which can ease the one-to-many mapping problem (i.e.,
multiple speech variations correspond to the same text) in TTS. However,
FastSpeech has several disadvantages: 1) the teacher-student distillation
pipeline is complicated and time-consuming, 2) the duration extracted from the
teacher model is not accurate enough, and the target mel-spectrograms distilled
from teacher model suffer from information loss due to data simplification,
both of which limit the voice quality. In this paper, we propose FastSpeech 2,
which addresses the issues in FastSpeech and better solves the one-to-many
mapping problem in TTS by 1) directly training the model with ground-truth
target instead of the simplified output from teacher, and 2) introducing more
variation information of speech (e.g., pitch, energy and more accurate
duration) as conditional inputs. Specifically, we extract duration, pitch and
energy from speech waveform and directly take them as conditional inputs in
training and use predicted values in inference. We further design FastSpeech
2s, which is the first attempt to directly generate speech waveform from text
in parallel, enjoying the benefit of fully end-to-end inference. Experimental
results show that 1) FastSpeech 2 achieves a 3x training speed-up over
FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech
2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even
surpass autoregressive models. Audio samples are available at
https://speechresearch.github.io/fastspeech2/.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - FlashSpeech: Efficient Zero-Shot Speech Synthesis [37.883762387219676]
FlashSpeech is a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work.
We show that FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity.
arXiv Detail & Related papers (2024-04-23T02:57:46Z) - DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.