PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a
Diffusion Probabilistic Model
- URL: http://arxiv.org/abs/2402.14692v1
- Date: Thu, 22 Feb 2024 16:47:15 GMT
- Title: PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a
Diffusion Probabilistic Model
- Authors: Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda
- Abstract summary: This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM)
Our model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals.
Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
- Score: 12.292092677396349
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a neural vocoder based on a denoising diffusion
probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary
conditioning signals. Recently, DDPM-based neural vocoders have gained
prominence as non-autoregressive models that can generate high-quality
waveforms. The neural vocoders based on DDPM have the advantage of training
with a simple time-domain loss. In practical applications, such as singing
voice synthesis, there is a demand for neural vocoders to generate
high-fidelity speech waveforms with flexible pitch control. However,
conventional DDPM-based neural vocoders struggle to generate speech waveforms
under such conditions. Our proposed model aims to accurately capture the
periodic structure of speech waveforms by incorporating explicit periodic
signals. Experimental results show that our model improves sound quality and
provides better pitch control than conventional DDPM-based neural vocoders.
Related papers
- From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on
Fixed-Point Iteration [47.07494621683752]
This study proposes a fast and high-quality neural vocoder called textitWaveFit.
WaveFit integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration.
Subjective listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.
arXiv Detail & Related papers (2022-10-03T15:45:05Z) - Avocodo: Generative Adversarial Network for Artifact-free Vocoder [5.956832212419584]
We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
arXiv Detail & Related papers (2022-06-27T15:54:41Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for
Speech Synthesis [2.4975981795360847]
Non-autoregressive neural vocoders such as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in terms of modeling audio signals.
NanoFlow is a state-of-the-art autoregressive neural vocoder that has immensely small parameters.
We propose FlowVocoder, which has a small memory footprint and is able to generate high-fidelity audio in real-time.
arXiv Detail & Related papers (2021-09-27T06:52:55Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.