RefineGAN: Universally Generating Waveform Better than Ground Truth with
Highly Accurate Pitch and Intensity Responses
- URL: http://arxiv.org/abs/2111.00962v2
- Date: Tue, 2 Nov 2021 09:30:28 GMT
- Title: RefineGAN: Universally Generating Waveform Better than Ground Truth with
Highly Accurate Pitch and Intensity Responses
- Authors: Shengyuan Xu, Wenxiao Zhao, Jing Guo
- Abstract summary: We propose RefineGAN, a high-fidelity neural vocoder with faster-than-real-time generation capability.
We employ a pitch-guided refine architecture with a multi-scale spectrogram-based loss function to help stabilize the training process.
We show that the fidelity is even improved during the waveform reconstruction by eliminating defects produced by the speaker.
- Score: 15.599745604729842
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Most GAN(Generative Adversarial Network)-based approaches towards
high-fidelity waveform generation heavily rely on discriminators to improve
their performance. However, the over-use of this GAN method introduces much
uncertainty into the generation process and often result in mismatches of pitch
and intensity, which is fatal when it comes to sensitive using cases such as
singing voice synthesis(SVS). To address this problem, we propose RefineGAN, a
high-fidelity neural vocoder with faster-than-real-time generation capability,
and focused on the robustness, pitch and intensity accuracy, and full-band
audio generation. We employed a pitch-guided refine architecture with a
multi-scale spectrogram-based loss function to help stabilize the training
process and maintain the robustness of the neural vocoder while using the
GAN-based training method. Audio generated using this method shows a better
performance in subjective tests when compared with the ground-truth audio. This
result shows that the fidelity is even improved during the waveform
reconstruction by eliminating defects produced by the speaker and the recording
procedure. Moreover, a further study shows that models trained on a specified
type of data can perform on totally unseen language and unseen speaker
identically well. Generated sample pairs are provided on
https://timedomain-tech.github.io/refinegan/.
Related papers
- SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Avocodo: Generative Adversarial Network for Artifact-free Vocoder [5.956832212419584]
We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
arXiv Detail & Related papers (2022-06-27T15:54:41Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.