HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis
- URL: http://arxiv.org/abs/2009.01776v1
- Date: Thu, 3 Sep 2020 16:31:02 GMT
- Title: HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis
- Authors: Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, Tie-Yan Liu
- Abstract summary: HiFiSinger is an SVS system towards high-fidelity singing voice.
It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder.
Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
- Score: 153.48507947322886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-fidelity singing voices usually require higher sampling rate (e.g.,
48kHz) to convey expression and emotion. However, higher sampling rate causes
the wider frequency band and longer waveform sequences and throws challenges
for singing voice synthesis (SVS) in both frequency and time domains.
Conventional SVS systems that adopt small sampling rate cannot well address the
above challenges. In this paper, we develop HiFiSinger, an SVS system towards
high-fidelity singing voice. HiFiSinger consists of a FastSpeech based acoustic
model and a Parallel WaveGAN based vocoder to ensure fast training and
inference and also high voice quality. To tackle the difficulty of singing
modeling caused by high sampling rate (wider frequency band and longer
waveform), we introduce multi-scale adversarial training in both the acoustic
model and vocoder to improve singing modeling. Specifically, 1) To handle the
larger range of frequencies caused by higher sampling rate, we propose a novel
sub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splits the full
80-dimensional mel-frequency into multiple sub-bands and models each sub-band
with a separate discriminator. 2) To model longer waveform sequences caused by
higher sampling rate, we propose a multi-length GAN (ML-GAN) for waveform
generation to model different lengths of waveform sequences with separate
discriminators. 3) We also introduce several additional designs and findings in
HiFiSinger that are crucial for high-fidelity voices, such as adding F0 (pitch)
and V/UV (voiced/unvoiced flag) as acoustic features, choosing an appropriate
window/hop size for mel-spectrogram, and increasing the receptive field in
vocoder for long vowel modeling. Experiment results show that HiFiSinger
synthesizes high-fidelity singing voices with much higher quality: 0.32/0.44
MOS gain over 48kHz/24kHz baseline and 0.83 MOS gain over previous SVS systems.
Related papers
- SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - Fre-GAN: Adversarial Frequency-consistent Audio Synthesis [39.69759686729388]
Fre-GAN achieves frequency-consistent audio synthesis with highly improved generation quality.
Fre-GAN achieves high-fidelity waveform generation with a gap of only 0.03 MOS compared to ground-truth audio.
arXiv Detail & Related papers (2021-06-04T07:12:39Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform
Generation in Multiple Domains [1.8047694351309207]
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains.
MelGAN-based structure is trained with a dataset of hundreds of speakers.
We added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
arXiv Detail & Related papers (2020-11-19T03:35:45Z) - HiFi-GAN: Generative Adversarial Networks for Efficient and High
Fidelity Speech Synthesis [12.934180951771596]
We propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
A subjective human evaluation of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality.
A small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.
arXiv Detail & Related papers (2020-10-12T12:33:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.