BigVGAN: A Universal Neural Vocoder with Large-Scale Training
- URL: http://arxiv.org/abs/2206.04658v1
- Date: Thu, 9 Jun 2022 17:56:10 GMT
- Title: BigVGAN: A Universal Neural Vocoder with Large-Scale Training
- Authors: Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon
- Abstract summary: We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
- Score: 49.16254684584935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent progress in generative adversarial network(GAN)-based
vocoders, where the model generates raw waveform conditioned on mel
spectrogram, it is still challenging to synthesize high-fidelity audio for
numerous speakers across varied recording environments. In this work, we
present BigVGAN, a universal vocoder that generalizes well under various unseen
conditions in zero-shot setting. We introduce periodic nonlinearities and
anti-aliased representation into the generator, which brings the desired
inductive bias for waveform synthesis and significantly improves audio quality.
Based on our improved generator and the state-of-the-art discriminators, we
train our GAN vocoder at the largest scale up to 112M parameters, which is
unprecedented in the literature. In particular, we identify and address the
training instabilities specific to such scale, while maintaining high-fidelity
output without over-regularization. Our BigVGAN achieves the state-of-the-art
zero-shot performance for various out-of-distribution scenarios, including new
speakers, novel languages, singing voices, music and instrumental audio in
unseen (even noisy) recording environments. We will release our code and model
at: https://github.com/NVIDIA/BigVGAN
Related papers
- WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders [14.222389985736422]
VNet is a GAN-based neural vocoder network that incorporates full-band spectral information.
We demonstrate that the VNet model is capable of generating high-fidelity speech.
arXiv Detail & Related papers (2024-08-13T14:00:02Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform
Generation in Multiple Domains [1.8047694351309207]
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains.
MelGAN-based structure is trained with a dataset of hundreds of speakers.
We added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
arXiv Detail & Related papers (2020-11-19T03:35:45Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice.
It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder.
Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.