ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to
Speech
- URL: http://arxiv.org/abs/2212.14518v1
- Date: Fri, 30 Dec 2022 02:31:35 GMT
- Title: ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to
Speech
- Authors: Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan,
Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic
- Abstract summary: DDPMs are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
Previous works have explored speeding up inference speed by minimizing the number of inference steps but at the cost of sample quality.
We propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model.
- Score: 37.29193613404699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Denoising Diffusion Probabilistic Models (DDPMs) are emerging in
text-to-speech (TTS) synthesis because of their strong capability of generating
high-fidelity samples. However, their iterative refinement process in
high-dimensional data space results in slow inference speed, which restricts
their application in real-time systems. Previous works have explored speeding
up by minimizing the number of inference steps but at the cost of sample
quality. In this work, to improve the inference speed for DDPM-based TTS model
while achieving high sample quality, we propose ResGrad, a lightweight
diffusion model which learns to refine the output spectrogram of an existing
TTS model (e.g., FastSpeech 2) by predicting the residual between the model
output and the corresponding ground-truth speech. ResGrad has several
advantages: 1) Compare with other acceleration methods for DDPM which need to
synthesize speech from scratch, ResGrad reduces the complexity of task by
changing the generation target from ground-truth mel-spectrogram to the
residual, resulting into a more lightweight model and thus a smaller real-time
factor. 2) ResGrad is employed in the inference process of the existing TTS
model in a plug-and-play way, without re-training this model. We verify ResGrad
on the single-speaker dataset LJSpeech and two more challenging datasets with
multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental
results show that in comparison with other speed-up methods of DDPMs: 1)
ResGrad achieves better sample quality with the same inference speed measured
by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech
faster than baseline methods by more than 10 times. Audio samples are available
at https://resgrad1.github.io/.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - Diff-TTS: A Denoising Diffusion Model for Text-to-Speech [14.231478930274058]
We propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis.
Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps.
We verify that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2021-04-03T13:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.