WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis
- URL: http://arxiv.org/abs/2206.09920v1
- Date: Mon, 20 Jun 2022 17:58:52 GMT
- Title: WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis
- Authors: Yi Wang, Yi Si
- Abstract summary: We propose an effective and lightweight neural vocoder called WOLONet.
In this paper, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights.
The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, HiFiGAN and UnivNet.
- Score: 4.689359813220365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, GAN-based neural vocoders such as Parallel WaveGAN, MelGAN,
HiFiGAN, and UnivNet have become popular due to their lightweight and parallel
structure, resulting in a real-time synthesized waveform with high fidelity,
even on a CPU. HiFiGAN and UnivNet are two SOTA vocoders. Despite their high
quality, there is still room for improvement. In this paper, motivated by the
structure of Vision Outlooker from computer vision, we adopt a similar idea and
propose an effective and lightweight neural vocoder called WOLONet. In this
network, we develop a novel lightweight block that uses a location-variable,
channel-independent, and depthwise dynamic convolutional kernel with
sinusoidally activated dynamic kernel weights. To demonstrate the effectiveness
and generalizability of our method, we perform an ablation study to verify our
novel design and make a subjective and objective comparison with typical
GAN-based vocoders. The results show that our WOLONet achieves the best
generation quality while requiring fewer parameters than the two neural SOTA
vocoders, HiFiGAN and UnivNet.
Related papers
- sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks [51.516451451719654]
Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient.
This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD.
It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
arXiv Detail & Related papers (2024-03-09T02:55:44Z) - HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise
Filter and Inverse Short Time Fourier Transform [21.896817015593122]
We introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain.
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN.
Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications.
arXiv Detail & Related papers (2023-09-18T05:30:15Z) - BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network [16.986061375767488]
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time.
Most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space.
We propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN.
arXiv Detail & Related papers (2023-09-06T08:48:03Z) - Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with
Very Low Computational Complexity [23.49462995118466]
Framewise WaveGAN vocoder achieves higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS.
This makes GAN vocoders more practical on edge and low-power devices.
arXiv Detail & Related papers (2022-12-08T19:38:34Z) - Spiking Neural Network Decision Feedback Equalization [70.3497683558609]
We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE)
We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels.
The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.
arXiv Detail & Related papers (2022-11-09T09:19:15Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Variational Autoencoders: A Harmonic Perspective [79.49579654743341]
We study Variational Autoencoders (VAEs) from the perspective of harmonic analysis.
We show that the encoder variance of a VAE controls the frequency content of the functions parameterised by the VAE encoder and decoder neural networks.
arXiv Detail & Related papers (2021-05-31T10:39:25Z) - StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.