SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis
- URL: http://arxiv.org/abs/2402.01753v1
- Date: Tue, 30 Jan 2024 09:17:57 GMT
- Title: SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis
- Authors: Teysir Baoueb (IP Paris, LTCI, IDS, S2A), Haocheng Liu (IP Paris,
LTCI, IDS, S2A), Mathieu Fontaine (IP Paris, LTCI, IDS, S2A), Jonathan Le
Roux (MERL), Gael Richard (IP Paris, LTCI, IDS, S2A)
- Abstract summary: We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative adversarial network (GAN) models can synthesize highquality audio
signals while ensuring fast sample generation. However, they are difficult to
train and are prone to several issues including mode collapse and divergence.
In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN,
which was initially devised for speech synthesis from mel spectrogram. In our
model, the training stability is enhanced by means of a forward diffusion
process which consists in injecting noise from a Gaussian distribution to both
real and fake samples before inputting them to the discriminator. We further
improve the model by exploiting a spectrally-shaped noise distribution with the
aim to make the discriminator's task more challenging. We then show the merits
of our proposed model for speech and music synthesis on several datasets. Our
experiments confirm that our model compares favorably in audio quality and
efficiency compared to several baselines.
Related papers
- Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - SpecSinGAN: Sound Effect Variation Synthesis Using Single-Image GANs [0.0]
Single-image generative adversarial networks learn from the internal distribution of a single training example to generate variations of it.
SpecSinGAN takes a single one-shot sound effect and produces novel variations of it, as if they were different takes from the same recording session.
arXiv Detail & Related papers (2021-10-14T12:25:52Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - CRASH: Raw Audio Score-based Generative Modeling for Controllable
High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis.
Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.