FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for
Speech Synthesis
- URL: http://arxiv.org/abs/2109.13675v1
- Date: Mon, 27 Sep 2021 06:52:55 GMT
- Title: FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for
Speech Synthesis
- Authors: Manh Luong and Viet Anh Tran
- Abstract summary: Non-autoregressive neural vocoders such as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in terms of modeling audio signals.
NanoFlow is a state-of-the-art autoregressive neural vocoder that has immensely small parameters.
We propose FlowVocoder, which has a small memory footprint and is able to generate high-fidelity audio in real-time.
- Score: 2.4975981795360847
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, non-autoregressive neural vocoders have provided remarkable
performance in generating high-fidelity speech and have been able to produce
synthetic speech in real-time. However, non-autoregressive neural vocoders such
as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in
terms of modeling audio signals due to their limitation in expressiveness. In
addition, though NanoFlow is a state-of-the-art autoregressive neural vocoder
that has immensely small parameters, its performance is marginally lower than
WaveFlow. Therefore, in this paper, we propose a new type of autoregressive
neural vocoder called FlowVocoder, which has a small memory footprint and is
able to generate high-fidelity audio in real-time. Our proposed model improves
the expressiveness of flow blocks by operating a mixture of Cumulative
Distribution Function(CDF) for bipartite transformation. Hence, the proposed
model is capable of modeling waveform signals as well as WaveFlow, while its
memory footprint is much smaller thanWaveFlow. As shown in experiments,
FlowVocoder achieves competitive results with baseline methods in terms of both
subjective and objective evaluation, also, it is more suitable for real-time
text-to-speech applications.
Related papers
- PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a
Diffusion Probabilistic Model [12.292092677396349]
This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM)
Our model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals.
Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2024-02-22T16:47:15Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on
Fixed-Point Iteration [47.07494621683752]
This study proposes a fast and high-quality neural vocoder called textitWaveFit.
WaveFit integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration.
Subjective listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.
arXiv Detail & Related papers (2022-10-03T15:45:05Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.