Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS
- URL: http://arxiv.org/abs/2308.01573v1
- Date: Thu, 3 Aug 2023 07:22:04 GMT
- Title: Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS
- Authors: Myeongjin Ko and Yong-Hoon Choi
- Abstract summary: The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The diffusion model is capable of generating high-quality data through a
probabilistic approach. However, it suffers from the drawback of slow
generation speed due to the requirement of a large number of time steps. To
address this limitation, recent models such as denoising diffusion implicit
models (DDIM) focus on generating samples without directly modeling the
probability distribution, while models like denoising diffusion generative
adversarial networks (GAN) combine diffusion processes with GANs. In the field
of speech synthesis, a recent diffusion speech synthesis model called
DiffGAN-TTS, utilizing the structure of GANs, has been introduced and
demonstrates superior performance in both speech quality and generation speed.
In this paper, to further enhance the performance of DiffGAN-TTS, we propose a
speech synthesis model with two discriminators: a diffusion discriminator for
learning the distribution of the reverse process and a spectrogram
discriminator for learning the distribution of the generated data. Objective
metrics such as structural similarity index measure (SSIM), mel-cepstral
distortion (MCD), F0 root mean squared error (F0 RMSE), short-time objective
intelligibility (STOI), perceptual evaluation of speech quality (PESQ), as well
as subjective metrics like mean opinion score (MOS), are used to evaluate the
performance of the proposed model. The evaluation results show that the
proposed model outperforms recent state-of-the-art models such as FastSpeech2
and DiffGAN-TTS in various metrics. Our implementation and audio samples are
located on GitHub.
Related papers
- SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation [21.335983674309475]
Diffusion models suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation.
We introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query.
We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space.
arXiv Detail & Related papers (2023-09-19T16:36:33Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising
Diffusion GANs [39.388599580262614]
We introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity speech synthesis.
Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.
arXiv Detail & Related papers (2022-01-28T07:41:10Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.