Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with
Very Low Computational Complexity
- URL: http://arxiv.org/abs/2212.04532v1
- Date: Thu, 8 Dec 2022 19:38:34 GMT
- Title: Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with
Very Low Computational Complexity
- Authors: Ahmed Mustafa, Jean-Marc Valin, Jan B\"uthe, Paris Smaragdis, Mike
Goodwin
- Abstract summary: Framewise WaveGAN vocoder achieves higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS.
This makes GAN vocoders more practical on edge and low-power devices.
- Score: 23.49462995118466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GAN vocoders are currently one of the state-of-the-art methods for building
high-quality neural waveform generative models. However, most of their
architectures require dozens of billion floating-point operations per second
(GFLOPS) to generate speech waveforms in samplewise manner. This makes GAN
vocoders still challenging to run on normal CPUs without accelerators or
parallel computers. In this work, we propose a new architecture for GAN
vocoders that mainly depends on recurrent and fully-connected networks to
directly generate the time domain signal in framewise manner. This results in
considerable reduction of the computational cost and enables very fast
generation on both GPUs and low-complexity CPUs. Experimental results show that
our Framewise WaveGAN vocoder achieves significantly higher quality than
auto-regressive maximum-likelihood vocoders such as LPCNet at a very low
complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and
low-power devices.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Real-Time Target Sound Extraction [13.526450617545537]
We present the first neural network model to achieve real-time and streaming target sound extraction.
We propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder.
arXiv Detail & Related papers (2022-11-04T03:51:23Z) - WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis [4.689359813220365]
We propose an effective and lightweight neural vocoder called WOLONet.
In this paper, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights.
The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, HiFiGAN and UnivNet.
arXiv Detail & Related papers (2022-06-20T17:58:52Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid
Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems.
We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate [8.312162364318235]
We present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s.
The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner.
arXiv Detail & Related papers (2021-08-09T14:03:07Z) - Multi-rate attention architecture for fast streamable Text-to-speech
spectrum modeling [5.080331097831114]
High quality text-to-speech (TTS) systems use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio.
While these models can produce high quality speech, they often incur OL$ in both latency and real-time factor (RTF) with respect to input length.
We propose a multi-rate architecture that breaks the latency bottlenecks by encoding a compact representation during streaming.
arXiv Detail & Related papers (2021-04-01T18:15:30Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - FastWave: Accelerating Autoregressive Convolutional Neural Networks on
FPGA [27.50143717931293]
WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution.
We develop the first accelerator platformtextitFastWave for autoregressive convolutional neural networks.
arXiv Detail & Related papers (2020-02-09T06:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.