GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from
Diffusion Models
- URL: http://arxiv.org/abs/2210.05271v1
- Date: Tue, 11 Oct 2022 09:12:29 GMT
- Title: GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from
Diffusion Models
- Authors: Matthew Baas and Herman Kamper
- Abstract summary: AudioStyleGAN (ASGAN) is a new generative adversarial network (GAN) for unconditional speech synthesis.
ASGAN achieves state-of-the-art results in unconditional speech synthesis on the Google Speech Commands dataset.
- Score: 23.822788597966646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose AudioStyleGAN (ASGAN), a new generative adversarial network (GAN)
for unconditional speech synthesis. As in the StyleGAN family of image
synthesis models, ASGAN maps sampled noise to a disentangled latent vector
which is then mapped to a sequence of audio features so that signal aliasing is
suppressed at every layer. To successfully train ASGAN, we introduce a number
of new techniques, including a modification to adaptive discriminator
augmentation to probabilistically skip discriminator updates. ASGAN achieves
state-of-the-art results in unconditional speech synthesis on the Google Speech
Commands dataset. It is also substantially faster than the top-performing
diffusion models. Through a design that encourages disentanglement, ASGAN is
able to perform voice conversion and speech editing without being explicitly
trained to do so. ASGAN demonstrates that GANs are still highly competitive
with diffusion models. Code, models, samples:
https://github.com/RF5/simple-asgan/.
Related papers
- SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Disentanglement in a GAN for Unconditional Speech Synthesis [28.998590651956153]
We propose AudioStyleGAN -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space.
ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer.
We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis.
arXiv Detail & Related papers (2023-07-04T12:06:07Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z) - TransFusion: Transcribing Speech with Multinomial Diffusion [20.165433724198937]
We propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features.
We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark.
We also propose new techniques for effectively sampling and decoding multinomial diffusion models.
arXiv Detail & Related papers (2022-10-14T10:01:43Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.