DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising
Diffusion GANs
- URL: http://arxiv.org/abs/2201.11972v1
- Date: Fri, 28 Jan 2022 07:41:10 GMT
- Title: DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising
Diffusion GANs
- Authors: Songxiang Liu, Dan Su, Dong Yu
- Abstract summary: We introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity speech synthesis.
Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.
- Score: 39.388599580262614
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Denoising diffusion probabilistic models (DDPMs) are expressive generative
models that have been used to solve a variety of speech synthesis problems.
However, because of their high sampling costs, DDPMs are difficult to use in
real-time speech processing applications. In this paper, we introduce
DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving
high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising
diffusion generative adversarial networks (GANs), which adopt an
adversarially-trained expressive model to approximate the denoising
distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can
generate high-fidelity speech samples within only 4 denoising steps. We present
an active shallow diffusion mechanism to further speed up inference. A
two-stage training scheme is proposed, with a basic TTS acoustic model trained
at stage one providing valuable prior information for a DDPM trained at stage
two. Our experiments show that DiffGAN-TTS can achieve high synthesis
performance with only 1 denoising step.
Related papers
- CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models [30.68516200579894]
We introduce CM-TTS, a novel architecture grounded in consistency models (CMs)
CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies.
We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations.
arXiv Detail & Related papers (2024-03-31T05:38:08Z) - Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis [35.16243386407448]
Bridge-TTS is a novel TTS system that substitutes the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one.
Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram.
arXiv Detail & Related papers (2023-12-06T13:31:55Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Diff-TTS: A Denoising Diffusion Model for Text-to-Speech [14.231478930274058]
We propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis.
Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps.
We verify that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2021-04-03T13:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.