FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis
- URL: http://arxiv.org/abs/2204.09934v1
- Date: Thu, 21 Apr 2022 07:49:09 GMT
- Title: FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis
- Authors: Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou
Zhao
- Abstract summary: This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
- Score: 90.3069686272524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Denoising diffusion probabilistic models (DDPMs) have recently achieved
leading performances in many generative tasks. However, the inherited iterative
sampling process costs hindered their applications to speech synthesis. This
paper proposes FastDiff, a fast conditional diffusion model for high-quality
speech synthesis. FastDiff employs a stack of time-aware location-variable
convolutions of diverse receptive field patterns to efficiently model long-term
time dependencies with adaptive conditions. A noise schedule predictor is also
adopted to reduce the sampling steps without sacrificing the generation
quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer,
FastDiff-TTS, which generates high-fidelity speech waveforms without any
intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff
demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech
samples. Also, FastDiff enables a sampling speed of 58x faster than real-time
on a V100 GPU, making diffusion models practically applicable to speech
synthesis deployment for the first time. We further show that FastDiff
generalized well to the mel-spectrogram inversion of unseen speakers, and
FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech
synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to
Speech [37.29193613404699]
DDPMs are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
Previous works have explored speeding up inference speed by minimizing the number of inference steps but at the cost of sample quality.
We propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model.
arXiv Detail & Related papers (2022-12-30T02:31:35Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Diff-TTS: A Denoising Diffusion Model for Text-to-Speech [14.231478930274058]
We propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis.
Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps.
We verify that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2021-04-03T13:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.