Related papers: WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

URL: http://arxiv.org/abs/2210.01029v1
Date: Mon, 3 Oct 2022 15:45:05 GMT
Title: WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration
Authors: Yuma Koizumi, Kohei Yatabe, Heiga Zen, Michiel Bacchiani
Abstract summary: This study proposes a fast and high-quality neural vocoder called textitWaveFit. WaveFit integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. Subjective listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.
Score: 47.07494621683752
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than WaveRNN. Audio demos are available at \url{google.github.io/df-conformer/wavefit/}.

Related papers

PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model [12.292092677396349]
This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) Our model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals. Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2024-02-22T16:47:15Z)
NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction. The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network. A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability. It generates waveforms at least 280 times faster than the WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z)
FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for Speech Synthesis [2.4975981795360847]
Non-autoregressive neural vocoders such as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in terms of modeling audio signals. NanoFlow is a state-of-the-art autoregressive neural vocoder that has immensely small parameters. We propose FlowVocoder, which has a small memory footprint and is able to generate high-fidelity audio in real-time.
arXiv Detail & Related papers (2021-09-27T06:52:55Z)
Frequency Gating: Improved Convolutional Neural Networks for Speech Enhancement in the Time-Frequency Domain [37.722450363816144]
We introduce a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN. Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline. A loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.
arXiv Detail & Related papers (2020-11-08T22:04:00Z)
DiffWave: A Versatile Diffusion Model for Audio Synthesis [35.406438835268816]
DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation. It produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram. It significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task.
arXiv Detail & Related papers (2020-09-21T11:20:38Z)
WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.