FBWave: Efficient and Scalable Neural Vocoders for Streaming
Text-To-Speech on the Edge
- URL: http://arxiv.org/abs/2011.12985v1
- Date: Wed, 25 Nov 2020 19:09:49 GMT
- Title: FBWave: Efficient and Scalable Neural Vocoders for Streaming
Text-To-Speech on the Edge
- Authors: Bichen Wu, Qing He, Peizhao Zhang, Thilo Koehler, Kurt Keutzer, Peter
Vajda
- Abstract summary: We propose FBWave, a family of efficient and scalable neural vocoders.
FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models.
Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x.
- Score: 49.85380252780985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays more and more applications can benefit from edge-based
text-to-speech (TTS). However, most existing TTS models are too computationally
expensive and are not flexible enough to be deployed on the diverse variety of
edge devices with their equally diverse computational capacities. To address
this, we propose FBWave, a family of efficient and scalable neural vocoders
that can achieve optimal performance-efficiency trade-offs for different edge
devices. FBWave is a hybrid flow-based generative model that combines the
advantages of autoregressive and non-autoregressive models. It produces high
quality audio and supports streaming during inference while remaining highly
computationally efficient. Our experiments show that FBWave can achieve similar
audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of
FBWave can achieve up to 109x fewer MACs while still delivering acceptable
audio quality. Audio demos are available at
https://bichenwu09.github.io/vocoder_demos.
Related papers
- StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling [50.537794606598254]
StreamMel is a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms.<n>It enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness.<n>It even achieves performance comparable to offline systems while supporting efficient real-time generation.
arXiv Detail & Related papers (2025-06-14T16:53:39Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices.
In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z) - WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on
Fixed-Point Iteration [47.07494621683752]
This study proposes a fast and high-quality neural vocoder called textitWaveFit.
WaveFit integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration.
Subjective listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.
arXiv Detail & Related papers (2022-10-03T15:45:05Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - Streamable Neural Audio Synthesis With Non-Causal Convolutions [1.8275108630751844]
We introduce a new method allowing to produce non-causal streaming models.
This allows to make any convolutional model compatible with real-time buffer-based processing.
We show how our method can be adapted to fit complex architectures with parallel branches.
arXiv Detail & Related papers (2022-04-14T16:00:32Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for
Speech Synthesis [2.4975981795360847]
Non-autoregressive neural vocoders such as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in terms of modeling audio signals.
NanoFlow is a state-of-the-art autoregressive neural vocoder that has immensely small parameters.
We propose FlowVocoder, which has a small memory footprint and is able to generate high-fidelity audio in real-time.
arXiv Detail & Related papers (2021-09-27T06:52:55Z) - DiffWave: A Versatile Diffusion Model for Audio Synthesis [35.406438835268816]
DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation.
It produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram.
It significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task.
arXiv Detail & Related papers (2020-09-21T11:20:38Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.