VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested
Adversarial Network
- URL: http://arxiv.org/abs/2007.15256v1
- Date: Thu, 30 Jul 2020 06:33:53 GMT
- Title: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested
Adversarial Network
- Authors: Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoonyoung Cho, Injung Kim
- Abstract summary: A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time.
VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform.
In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time.
- Score: 9.274656542624658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel high-fidelity real-time neural vocoder called VocGAN. A
recently developed GAN-based vocoder, MelGAN, produces speech waveforms in
real-time. However, it often produces a waveform that is insufficient in
quality or inconsistent with acoustic characteristics of the input mel
spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves
the quality and consistency of the output waveform. VocGAN applies a
multi-scale waveform generator and a hierarchically-nested discriminator to
learn multiple levels of acoustic properties in a balanced way. It also applies
the joint conditional and unconditional objective, which has shown successful
results in high-resolution image synthesis. In experiments, VocGAN synthesizes
speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU
than real-time. Compared with MelGAN, it also exhibits significantly improved
quality in multiple evaluation metrics including mean opinion score (MOS) with
minimal additional overhead. Additionally, compared with Parallel WaveGAN,
another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a
CPU and exhibits higher MOS.
Related papers
- From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with
Very Low Computational Complexity [23.49462995118466]
Framewise WaveGAN vocoder achieves higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS.
This makes GAN vocoders more practical on edge and low-power devices.
arXiv Detail & Related papers (2022-12-08T19:38:34Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform
Generation in Multiple Domains [1.8047694351309207]
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains.
MelGAN-based structure is trained with a dataset of hundreds of speakers.
We added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
arXiv Detail & Related papers (2020-11-19T03:35:45Z) - StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z) - HiFi-GAN: Generative Adversarial Networks for Efficient and High
Fidelity Speech Synthesis [12.934180951771596]
We propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
A subjective human evaluation of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality.
A small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.
arXiv Detail & Related papers (2020-10-12T12:33:43Z) - HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice.
It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder.
Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.