InferGrad: Improving Diffusion Models for Vocoder by Considering
Inference in Training
- URL: http://arxiv.org/abs/2202.03751v1
- Date: Tue, 8 Feb 2022 09:40:58 GMT
- Title: InferGrad: Improving Diffusion Models for Vocoder by Considering
Inference in Training
- Authors: Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo Mandic, Lei He, Sheng
Zhao
- Abstract summary: InferGrad is a diffusion model for vocoder that incorporates inference process into training.
InferGrad achieves better voice quality than the baseline WaveGrad under same conditions.
- Score: 33.91980890184044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Denoising diffusion probabilistic models (diffusion models for short) require
a large number of iterations in inference to achieve the generation quality
that matches or surpasses the state-of-the-art generative models, which
invariably results in slow inference speed. Previous approaches aim to optimize
the choice of inference schedule over a few iterations to speed up inference.
However, this results in reduced generation quality, mainly because the
inference process is optimized separately, without jointly optimizing with the
training process. In this paper, we propose InferGrad, a diffusion model for
vocoder that incorporates inference process into training, to reduce the
inference iterations while maintaining high generation quality. More
specifically, during training, we generate data from random noise through a
reverse process under inference schedules with a few iterations, and impose a
loss to minimize the gap between the generated and ground-truth data samples.
Then, unlike existing approaches, the training of InferGrad considers the
inference process. The advantages of InferGrad are demonstrated through
experiments on the LJSpeech dataset showing that InferGrad achieves better
voice quality than the baseline WaveGrad under same conditions while
maintaining the same voice quality as the baseline but with $3$x speedup ($2$
iterations for InferGrad vs $6$ iterations for WaveGrad).
Related papers
- DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Can Diffusion Model Achieve Better Performance in Text Generation?
Bridging the Gap between Training and Inference! [14.979893207094221]
Diffusion models have been successfully adapted to text generation tasks by mapping the discrete text into the continuous space.
There exist nonnegligible gaps between training and inference, owing to the absence of the forward process during inference.
We propose two simple yet effective methods to bridge the gaps mentioned above, named Distance Penalty and Adaptive Decay Sampling.
arXiv Detail & Related papers (2023-05-08T05:32:22Z) - ReDi: Efficient Learning-Free Diffusion Inference via Trajectory
Retrieval [68.7008281316644]
ReDi is a learning-free Retrieval-based Diffusion sampling framework.
We show that ReDi improves the model inference efficiency by 2x speedup.
arXiv Detail & Related papers (2023-02-05T03:01:28Z) - ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to
Speech [37.29193613404699]
DDPMs are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
Previous works have explored speeding up inference speed by minimizing the number of inference steps but at the cost of sample quality.
We propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model.
arXiv Detail & Related papers (2022-12-30T02:31:35Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.