NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation
- URL: http://arxiv.org/abs/2203.02678v1
- Date: Sat, 5 Mar 2022 08:15:29 GMT
- Title: NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation
- Authors: Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen
- Abstract summary: We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
- Score: 67.96138567288197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The traditional vocoders have the advantages of high synthesis efficiency,
strong interpretability, and speech editability, while the neural vocoders have
the advantage of high synthesis quality. To combine the advantages of two
vocoders, inspired by the traditional deterministic plus stochastic model, this
paper proposes a novel neural vocoder named NeuralDPS which can retain high
speech quality and acquire high synthesis efficiency and noise controllability.
Firstly, this framework contains four modules: a deterministic source module, a
stochastic source module, a neural V/UV decision module and a neural filter
module. The input required by the vocoder is just the spectral parameter, which
avoids the error caused by estimating additional parameters, such as F0.
Secondly, to solve the problem that different frequency bands may have
different proportions of deterministic components and stochastic components, a
multiband excitation strategy is used to generate a more accurate excitation
signal and reduce the neural filter's burden. Thirdly, a method to control
noise components of speech is proposed. In this way, the signal-to-noise ratio
(SNR) of speech can be adjusted easily. Objective and subjective experimental
results show that our proposed NeuralDPS vocoder can obtain similar performance
with the WaveNet and it generates waveforms at least 280 times faster than the
WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a
single CPU core. We have also verified through experiments that this method can
effectively control the noise components in the predicted speech and adjust the
SNR of speech. Examples of generated speech can be found at
https://hairuo55.github.io/NeuralDPS.
Related papers
- PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a
Diffusion Probabilistic Model [12.292092677396349]
This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM)
Our model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals.
Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2024-02-22T16:47:15Z) - WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on
Fixed-Point Iteration [47.07494621683752]
This study proposes a fast and high-quality neural vocoder called textitWaveFit.
WaveFit integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration.
Subjective listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.
arXiv Detail & Related papers (2022-10-03T15:45:05Z) - Avocodo: Generative Adversarial Network for Artifact-free Vocoder [5.956832212419584]
We propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts.
Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech.
arXiv Detail & Related papers (2022-06-27T15:54:41Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for
Speech Synthesis [2.4975981795360847]
Non-autoregressive neural vocoders such as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in terms of modeling audio signals.
NanoFlow is a state-of-the-art autoregressive neural vocoder that has immensely small parameters.
We propose FlowVocoder, which has a small memory footprint and is able to generate high-fidelity audio in real-time.
arXiv Detail & Related papers (2021-09-27T06:52:55Z) - StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.