Related papers: Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

URL: http://arxiv.org/abs/2210.15533v2
Date: Mon, 31 Oct 2022 02:58:35 GMT
Title: Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder
Authors: Reo Yoneyama, Yi-Chiao Wu, and Tomoki Toda
Abstract summary: We introduce the source-filter theory into HiFi-GAN to achieve high voice quality and pitch controllability. Our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and synthesis speed on a single CPU. Unlike the uSFGAN vocoder, the proposed method can be easily adopted/integrated in real-time applications and end-to-end systems.
Score: 29.219277429553788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generation thanks to the efficient upsampling-based generator architecture, the pitch controllability is severely limited. To realize a fast and pitch-controllable high-fidelity neural vocoder, we introduce the source-filter theory into HiFi-GAN by hierarchically conditioning the resonance filtering network on a well-estimated source excitation information. According to the experimental results, our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and synthesis speed on a single CPU. Furthermore, unlike the uSFGAN vocoder, the proposed method can be easily adopted/integrated in real-time applications and end-to-end systems.

Related papers

Spatial Annealing for Efficient Few-shot Neural Rendering [73.49548565633123]
We introduce an accurate and efficient few-shot neural rendering method named textbfSpatial textbfAnnealing regularized textbfNeRF (textbfSANeRF) By adding merely one line of code, SANeRF delivers superior rendering quality and much faster reconstruction speed compared to current few-shot neural rendering methods.
arXiv Detail & Related papers (2024-06-12T02:48:52Z)
HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform [21.896817015593122]
We introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications.
arXiv Detail & Related papers (2023-09-18T05:30:15Z)
Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity [23.49462995118466]
Framewise WaveGAN vocoder achieves higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and low-power devices.
arXiv Detail & Related papers (2022-12-08T19:38:34Z)
WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis [4.689359813220365]
We propose an effective and lightweight neural vocoder called WOLONet. In this paper, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights. The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, HiFiGAN and UnivNet.
arXiv Detail & Related papers (2022-06-20T17:58:52Z)
BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform. We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z)
Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation [32.839539624717546]
This paper introduces a unified source-filter network with a harmonic-plus-noise source excitation generation mechanism. The modified uSFGAN significantly improves the sound quality of the basic uSFGAN while maintaining the voice controllability.
arXiv Detail & Related papers (2022-05-12T12:41:15Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability. It generates waveforms at least 280 times faster than the WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z)
Frequency-bin entanglement from domain-engineered down-conversion [101.18253437732933]
We present a single-pass source of discrete frequency-bin entanglement which does not use filtering or a resonant cavity. We use a domain-engineered nonlinear crystal to generate an eight-mode frequency-bin entangled source at telecommunication wavelengths.
arXiv Detail & Related papers (2022-01-18T19:00:29Z)
Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN [36.12470085926042]
We propose a unified approach to data-driven source-filter modeling using a single neural network for developing a neural vocoder. Our proposed network called unified source-filter generative adversarial networks (uSFGAN) is developed by factorizing quasi-periodic parallel WaveGAN. Experiments demonstrate that uSFGAN outperforms conventional neural vocoders, such as QPPWG and NSF in both speech quality and pitch controllability.
arXiv Detail & Related papers (2021-04-10T02:38:26Z)
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice. It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder. Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z)
Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness. The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.