ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech
- URL: http://arxiv.org/abs/2207.06389v1
- Date: Wed, 13 Jul 2022 17:45:43 GMT
- Title: ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech
- Authors: Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren
- Abstract summary: We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
- Score: 63.780196620966905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Denoising diffusion probabilistic models (DDPMs) have recently achieved
leading performances in many generative tasks. However, the inherited iterative
sampling process costs hinder their applications to text-to-speech deployment.
Through the preliminary study on diffusion model parameterization, we find that
previous gradient-based TTS models require hundreds or thousands of iterations
to guarantee high sample quality, which poses a challenge for accelerating
sampling. In this work, we propose ProDiff, on progressive fast diffusion model
for high-quality text-to-speech. Unlike previous work estimating the gradient
for data density, ProDiff parameterizes the denoising model by directly
predicting clean data to avoid distinct quality degradation in accelerating
sampling. To tackle the model convergence challenge with decreased diffusion
iterations, ProDiff reduces the data variance in the target site via knowledge
distillation. Specifically, the denoising model uses the generated
mel-spectrogram from an N-step DDIM teacher as the training target and distills
the behavior into a new model with N/2 steps. As such, it allows the TTS model
to make sharp predictions and further reduces the sampling time by orders of
magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to
synthesize high-fidelity mel-spectrograms, while it maintains sample quality
and diversity competitive with state-of-the-art models using hundreds of steps.
ProDiff enables a sampling speed of 24x faster than real-time on a single
NVIDIA 2080Ti GPU, making diffusion models practically applicable to
text-to-speech synthesis deployment for the first time. Our extensive ablation
studies demonstrate that each design in ProDiff is effective, and we further
show that ProDiff can be easily extended to the multi-speaker setting. Audio
samples are available at \url{https://ProDiff.github.io/.}
Related papers
- Directly Denoising Diffusion Models [6.109141407163027]
We present Directly Denoising Diffusion Model (DDDM), a simple and generic approach for generating realistic images with few-step sampling.
Our model achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing those obtained from GANs and distillation-based models.
For ImageNet 64x64, our approach stands as a competitive contender against leading models.
arXiv Detail & Related papers (2024-05-22T11:20:32Z) - Towards More Accurate Diffusion Model Acceleration with A Timestep
Aligner [84.97253871387028]
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed.
We propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost.
Experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods.
arXiv Detail & Related papers (2023-10-14T02:19:07Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Parallel Sampling of Diffusion Models [76.3124029406809]
Diffusion models are powerful generative models but suffer from slow sampling.
We present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel.
arXiv Detail & Related papers (2023-05-25T17:59:42Z) - ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to
Speech [37.29193613404699]
DDPMs are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
Previous works have explored speeding up inference speed by minimizing the number of inference steps but at the cost of sample quality.
We propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model.
arXiv Detail & Related papers (2022-12-30T02:31:35Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - Diff-TTS: A Denoising Diffusion Model for Text-to-Speech [14.231478930274058]
We propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis.
Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps.
We verify that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2021-04-03T13:53:19Z) - Denoising Diffusion Implicit Models [117.03720513930335]
We present denoising diffusion implicit models (DDIMs) for iterative implicit probabilistic models with the same training procedure as DDPMs.
DDIMs can produce high quality samples $10 times$ to $50 times$ faster in terms of wall-clock time compared to DDPMs.
arXiv Detail & Related papers (2020-10-06T06:15:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.