Disentanglement in a GAN for Unconditional Speech Synthesis
- URL: http://arxiv.org/abs/2307.01673v2
- Date: Thu, 25 Jan 2024 13:44:10 GMT
- Title: Disentanglement in a GAN for Unconditional Speech Synthesis
- Authors: Matthew Baas and Herman Kamper
- Abstract summary: We propose AudioStyleGAN -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space.
ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer.
We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis.
- Score: 28.998590651956153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can we develop a model that can synthesize realistic speech directly from a
latent space, without explicit conditioning? Despite several efforts over the
last decade, previous adversarial and diffusion-based approaches still struggle
to achieve this, even on small-vocabulary datasets. To address this, we propose
AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional
speech synthesis tailored to learn a disentangled latent space. Building upon
the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a
disentangled latent vector which is then mapped to a sequence of audio features
so that signal aliasing is suppressed at every layer. To successfully train
ASGAN, we introduce a number of new techniques, including a modification to
adaptive discriminator augmentation which probabilistically skips discriminator
updates. We apply it on the small-vocabulary Google Speech Commands digits
dataset, where it achieves state-of-the-art results in unconditional speech
synthesis. It is also substantially faster than existing top-performing
diffusion models. We confirm that ASGAN's latent space is disentangled: we
demonstrate how simple linear operations in the space can be used to perform
several tasks unseen during training. Specifically, we perform evaluations in
voice conversion, speech enhancement, speaker verification, and keyword
classification. Our work indicates that GANs are still highly competitive in
the unconditional speech synthesis landscape, and that disentangled latent
spaces can be used to aid generalization to unseen tasks. Code, models,
samples: https://github.com/RF5/simple-asgan/
Related papers
- VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser.
We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised.
We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain
Generalization [21.591831983223997]
We propose an exemplar-based style synthesis pipeline to improve domain generalization in semantic segmentation.
Our method is based on a novel masked noise encoder for StyleGAN2 inversion.
We achieve up to $12.4%$ mIoU improvements on driving-scene semantic segmentation under different types of data shifts.
arXiv Detail & Related papers (2023-07-02T19:56:43Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from
Diffusion Models [23.822788597966646]
AudioStyleGAN (ASGAN) is a new generative adversarial network (GAN) for unconditional speech synthesis.
ASGAN achieves state-of-the-art results in unconditional speech synthesis on the Google Speech Commands dataset.
arXiv Detail & Related papers (2022-10-11T09:12:29Z) - GANtron: Emotional Speech Synthesis with Generative Adversarial Networks [0.0]
We propose a text-to-speech model where the inferred speech can be tuned with the desired emotions.
We use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism.
arXiv Detail & Related papers (2021-10-06T10:44:30Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.