Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
- URL: http://arxiv.org/abs/2106.02297v1
- Date: Fri, 4 Jun 2021 07:12:39 GMT
- Title: Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
- Authors: Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, Seong-Whan Lee
- Abstract summary: Fre-GAN achieves frequency-consistent audio synthesis with highly improved generation quality.
Fre-GAN achieves high-fidelity waveform generation with a gap of only 0.03 MOS compared to ground-truth audio.
- Score: 39.69759686729388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although recent works on neural vocoder have improved the quality of
synthesized audio, there still exists a gap between generated and ground-truth
audio in frequency space. This difference leads to spectral artifacts such as
hissing noise or robotic sound, and thus degrades the sample quality. In this
paper, we propose Fre-GAN which achieves frequency-consistent audio synthesis
with highly improved generation quality. Specifically, we first present
resolution-connected generator and resolution-wise discriminators, which help
learn various scales of spectral distributions over multiple frequency bands.
Additionally, to reproduce high-frequency components accurately, we leverage
discrete wavelet transform in the discriminators. From our experiments, Fre-GAN
achieves high-fidelity waveform generation with a gap of only 0.03 MOS compared
to ground-truth audio while outperforming standard models in quality.
Related papers
- SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Avocodo: Generative Adversarial Network for Artifact-free Vocoder [5.956832212419584]
We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
arXiv Detail & Related papers (2022-06-27T15:54:41Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - On the Frequency Bias of Generative Models [61.60834513380388]
We analyze proposed measures against high-frequency artifacts in state-of-the-art GAN training.
We find that none of the existing approaches can fully resolve spectral artifacts yet.
Our results suggest that there is great potential in improving the discriminator.
arXiv Detail & Related papers (2021-11-03T18:12:11Z) - RefineGAN: Universally Generating Waveform Better than Ground Truth with
Highly Accurate Pitch and Intensity Responses [15.599745604729842]
We propose RefineGAN, a high-fidelity neural vocoder with faster-than-real-time generation capability.
We employ a pitch-guided refine architecture with a multi-scale spectrogram-based loss function to help stabilize the training process.
We show that the fidelity is even improved during the waveform reconstruction by eliminating defects produced by the speaker.
arXiv Detail & Related papers (2021-11-01T14:12:54Z) - HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice.
It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder.
Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.