SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping
- URL: http://arxiv.org/abs/2203.16749v1
- Date: Thu, 31 Mar 2022 02:08:27 GMT
- Title: SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping
- Authors: Yuma Koizumi and Heiga Zen and Kohei Yatabe and Nanxin Chen and
Michiel Bacchiani
- Abstract summary: SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
- Score: 51.698273019061645
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Neural vocoder using denoising diffusion probabilistic model (DDPM) has been
improved by adaptation of the diffusion noise distribution to given acoustic
features. In this study, we propose SpecGrad that adapts the diffusion noise so
that its time-varying spectral envelope becomes close to the conditioning
log-mel spectrogram. This adaptation by time-varying filtering improves the
sound quality especially in the high-frequency bands. It is processed in the
time-frequency domain to keep the computational cost almost the same as the
conventional DDPM-based neural vocoders. Experimental results showed that
SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based
neural vocoders in both analysis-synthesis and speech enhancement scenarios.
Audio demos are available at wavegrad.github.io/specgrad/.
Related papers
- PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a
Diffusion Probabilistic Model [12.292092677396349]
This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM)
Our model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals.
Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2024-02-22T16:47:15Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - RefineGAN: Universally Generating Waveform Better than Ground Truth with
Highly Accurate Pitch and Intensity Responses [15.599745604729842]
We propose RefineGAN, a high-fidelity neural vocoder with faster-than-real-time generation capability.
We employ a pitch-guided refine architecture with a multi-scale spectrogram-based loss function to help stabilize the training process.
We show that the fidelity is even improved during the waveform reconstruction by eliminating defects produced by the speaker.
arXiv Detail & Related papers (2021-11-01T14:12:54Z) - PriorGrad: Improving Conditional Denoising Diffusion Models with
Data-Driven Adaptive Prior [103.00403682863427]
We propose PriorGrad to improve the efficiency of the conditional diffusion model.
We show that PriorGrad achieves a faster convergence leading to data and parameter efficiency and improved quality.
arXiv Detail & Related papers (2021-06-11T14:04:03Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - Audio Dequantization for High Fidelity Audio Generation in Flow-based
Neural Vocoder [29.63675159839434]
Flow-based neural vocoder has shown significant improvement in real-time speech generation task.
We propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation.
arXiv Detail & Related papers (2020-08-16T09:37:18Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.