Avocodo: Generative Adversarial Network for Artifact-free Vocoder
- URL: http://arxiv.org/abs/2206.13404v2
- Date: Tue, 28 Jun 2022 04:33:51 GMT
- Title: Avocodo: Generative Adversarial Network for Artifact-free Vocoder
- Authors: Taejun Bak, Junmo Lee, Hanbin Bae, Jinhyeok Yang, Jae-Sung Bae,
Young-Sun Joo
- Abstract summary: We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
- Score: 5.956832212419584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural vocoders based on the generative adversarial neural network (GAN) have
been widely used due to their fast inference speed and lightweight networks
while generating high-quality speech waveforms. Since the perceptually
important speech components are primarily concentrated in the low-frequency
band, most of the GAN-based neural vocoders perform multi-scale analysis that
evaluates downsampled speech waveforms. This multi-scale analysis helps the
generator improve speech intelligibility. However, in preliminary experiments,
we observed that the multi-scale analysis which focuses on the low-frequency
band causes unintended artifacts, e.g., aliasing and imaging artifacts, and
these artifacts degrade the synthesized speech waveform quality. Therefore, in
this paper, we investigate the relationship between these artifacts and
GAN-based neural vocoders and propose a GAN-based neural vocoder, called
Avocodo, that allows the synthesis of high-fidelity speech with reduced
artifacts. We introduce two kinds of discriminators to evaluate waveforms in
various perspectives: a collaborative multi-band discriminator and a sub-band
discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain
downsampled multi-band waveforms while avoiding aliasing. The experimental
results show that Avocodo outperforms conventional GAN-based neural vocoders in
both speech and singing voice synthesis tasks and can synthesize artifact-free
speech. Especially, Avocodo is even capable to reproduce high-quality waveforms
of unseen speakers.
Related papers
- SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - Fre-GAN: Adversarial Frequency-consistent Audio Synthesis [39.69759686729388]
Fre-GAN achieves frequency-consistent audio synthesis with highly improved generation quality.
Fre-GAN achieves high-fidelity waveform generation with a gap of only 0.03 MOS compared to ground-truth audio.
arXiv Detail & Related papers (2021-06-04T07:12:39Z) - Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform
Generation in Multiple Domains [1.8047694351309207]
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains.
MelGAN-based structure is trained with a dataset of hundreds of speakers.
We added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
arXiv Detail & Related papers (2020-11-19T03:35:45Z) - StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z) - WaveTransform: Crafting Adversarial Examples via Input Decomposition [69.01794414018603]
We introduce WaveTransform', that creates adversarial noise corresponding to low-frequency and high-frequency subbands, separately (or in combination)
Experiments show that the proposed attack is effective against the defense algorithm and is also transferable across CNNs.
arXiv Detail & Related papers (2020-10-29T17:16:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.