High-Fidelity and Low-Latency Universal Neural Vocoder based on
Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform
Modeling
- URL: http://arxiv.org/abs/2105.09856v1
- Date: Thu, 20 May 2021 16:02:45 GMT
- Title: High-Fidelity and Low-Latency Universal Neural Vocoder based on
Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform
Modeling
- Authors: Patrick Lumban Tobing, Tomoki Toda
- Abstract summary: This paper presents a novel universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP)
Experiments demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions.
- Score: 38.828260316517536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel high-fidelity and low-latency universal neural
vocoder framework based on multiband WaveRNN with data-driven linear prediction
for discrete waveform modeling (MWDLP). MWDLP employs a coarse-fine bit WaveRNN
architecture for 10-bit mu-law waveform modeling. A sparse gated recurrent unit
with a relatively large size of hidden units is utilized, while the multiband
modeling is deployed to achieve real-time low-latency usage. A novel technique
for data-driven linear prediction (LP) with discrete waveform modeling is
proposed, where the LP coefficients are estimated in a data-driven manner.
Moreover, a novel loss function using short-time Fourier transform (STFT) for
discrete waveform modeling with Gumbel approximation is also proposed. The
experimental results demonstrate that the proposed MWDLP framework generates
high-fidelity synthetic speech for seen and unseen speakers and/or language on
300 speakers training data including clean and noisy/reverberant conditions,
where the number of training utterances is limited to 60 per speaker, while
allowing for real-time low-latency processing using a single core of $\sim\!$
2.1--2.7~GHz CPU with $\sim\!$ 0.57--0.64 real-time factor including
input/output and feature extraction.
Related papers
- MixLinear: Extreme Low Resource Multivariate Time Series Forecasting with 0.1K Parameters [6.733646592789575]
Long-term Time Series Forecasting (LTSF) involves predicting long-term values by analyzing a large amount of historical time-series data to identify patterns and trends.
Transformer-based models offer high forecasting accuracy, but they are often too compute-intensive to be deployed on devices with hardware constraints.
We propose MixLinear, an ultra-lightweight time series forecasting model specifically designed for resource-constrained devices.
arXiv Detail & Related papers (2024-10-02T23:04:57Z) - Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization [37.35829410807451]
This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization.
It only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics.
By scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance.
arXiv Detail & Related papers (2024-08-15T08:34:00Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS [1.8927791081850118]
This paper introduces R-MelNet, a two-part autoregressive architecture with a backend WaveRNN-style audio decoder.
The model produces low-resolution mel-spectral features which are used by a WaveRNN decoder to produce an audio waveform.
arXiv Detail & Related papers (2022-06-30T13:29:31Z) - Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic
Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear
Prediction [38.828260316517536]
This paper presents a low-latency real-time (LLRT) non-parallel voice conversion framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP)
The proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of $2.1$--$2.7$GHz CPU on a real-time factor of $0.87$--$0.95$, including input/output, feature extraction, on a frame shift of $10$ ms, a window length of $27.5$ ms
arXiv Detail & Related papers (2021-05-20T16:06:11Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.