Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform
Generation in Multiple Domains
- URL: http://arxiv.org/abs/2011.09631v2
- Date: Thu, 4 Mar 2021 02:00:12 GMT
- Title: Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform
Generation in Multiple Domains
- Authors: Won Jang, Dan Lim, Jaesam Yoon
- Abstract summary: We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains.
MelGAN-based structure is trained with a dataset of hundreds of speakers.
We added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
- Score: 1.8047694351309207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech
in multiple domains. To preserve sound quality when the MelGAN-based structure
is trained with a dataset of hundreds of speakers, we added multi-resolution
spectrogram discriminators to sharpen the spectral resolution of the generated
waveforms. This enables the model to generate realistic waveforms of
multi-speakers, by alleviating the over-smoothing problem in the high frequency
band of the large footprint model. Our structure generates signals close to
ground-truth data without reducing the inference speed, by discriminating the
waveform and spectrogram during training. The model achieved the best mean
opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an
input. Especially, it showed superior performance in unseen domains with regard
of speaker, emotion, and language. Moreover, in a multi-speaker text-to-speech
scenario using mel-spectrogram generated by a transformer model, it synthesized
high-fidelity speech of 4.22 MOS. These results, achieved without external
domain information, highlight the potential of the proposed model as a
universal vocoder.
Related papers
- VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders [14.222389985736422]
VNet is a GAN-based neural vocoder network that incorporates full-band spectral information.
We demonstrate that the VNet model is capable of generating high-fidelity speech.
arXiv Detail & Related papers (2024-08-13T14:00:02Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Avocodo: Generative Adversarial Network for Artifact-free Vocoder [5.956832212419584]
We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
arXiv Detail & Related papers (2022-06-27T15:54:41Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice.
It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder.
Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.