ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech
- URL: http://arxiv.org/abs/2202.07816v1
- Date: Wed, 16 Feb 2022 01:42:32 GMT
- Title: ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech
- Authors: Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie
Yan, Zhou Zhao
- Abstract summary: We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
- Score: 96.0009517132463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expressive text-to-speech (TTS) has become a hot research topic recently,
mainly focusing on modeling prosody in speech. Prosody modeling has several
challenges: 1) the extracted pitch used in previous prosody modeling works have
inevitable errors, which hurts the prosody modeling; 2) different attributes of
prosody (e.g., pitch, duration and energy) are dependent on each other and
produce the natural prosody together; and 3) due to high variability of prosody
and the limited amount of high-quality data for TTS training, the distribution
of prosody cannot be fully shaped. To tackle these issues, we propose
ProsoSpeech, which enhances the prosody using quantized latent vectors
pre-trained on large-scale unpaired and low-quality text and speech data.
Specifically, we first introduce a word-level prosody encoder, which quantizes
the low-frequency band of the speech and compresses prosody attributes in the
latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts
LPV given word sequence. We pre-train the LPV predictor on large-scale text and
low-quality speech data and fine-tune it on the high-quality TTS dataset.
Finally, our model can generate expressive speech conditioned on the predicted
LPV. Experimental results show that ProsoSpeech can generate speech with richer
prosody compared with baseline methods.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - DPP-TTS: Diversifying prosodic features of speech via determinantal
point processes [16.461724709212863]
We propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a prosody diversifying module.
Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples.
arXiv Detail & Related papers (2023-10-23T07:59:46Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing
Linguistic Information and Noisy Data [20.132799566988826]
We propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling.
Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.
arXiv Detail & Related papers (2021-11-15T05:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.