Related papers: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

URL: http://arxiv.org/abs/2007.15256v1
Date: Thu, 30 Jul 2020 06:33:53 GMT
Title: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network
Authors: Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoonyoung Cho, Injung Kim
Abstract summary: A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time.
Score: 9.274656542624658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a CPU and exhibits higher MOS.

Related papers

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity [23.49462995118466]
Framewise WaveGAN vocoder achieves higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and low-power devices.
arXiv Detail & Related papers (2022-12-08T19:38:34Z)
Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z)
BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform. We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z)
NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability. It generates waveforms at least 280 times faster than the WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z)
RAVE: A variational autoencoder for fast and high-quality neural audio synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z)
Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains [1.8047694351309207]
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. MelGAN-based structure is trained with a dataset of hundreds of speakers. We added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
arXiv Detail & Related papers (2020-11-19T03:35:45Z)
StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech. The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z)
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis [12.934180951771596]
We propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. A subjective human evaluation of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality. A small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.
arXiv Detail & Related papers (2020-10-12T12:33:43Z)
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice. It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder. Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z)
Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.