Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic
Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear
Prediction
- URL: http://arxiv.org/abs/2105.09858v1
- Date: Thu, 20 May 2021 16:06:11 GMT
- Title: Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic
Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear
Prediction
- Authors: Patrick Lumban Tobing, Tomoki Toda
- Abstract summary: This paper presents a low-latency real-time (LLRT) non-parallel voice conversion framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP)
The proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of $2.1$--$2.7$GHz CPU on a real-time factor of $0.87$--$0.95$, including input/output, feature extraction, on a frame shift of $10$ ms, a window length of $27.5$ ms
- Score: 38.828260316517536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a low-latency real-time (LLRT) non-parallel voice
conversion (VC) framework based on cyclic variational autoencoder (CycleVAE)
and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a
robust non-parallel multispeaker spectral model, which utilizes a
speaker-independent latent space and a speaker-dependent code to generate
reconstructed/converted spectral features given the spectral features of an
input speaker. On the other hand, MWDLP is an efficient and a high-quality
neural vocoder that can handle multispeaker data and generate speech waveform
for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we
propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral
features and is built with a sparse network architecture. Further, to improve
the modeling performance, we also propose a novel fine-tuning procedure that
refines the frame-rate CycleVAE network by utilizing the waveform loss from the
MWDLP network. The experimental results demonstrate that the proposed framework
achieves high-performance VC, while allowing for LLRT usage with a single-core
of $2.1$--$2.7$~GHz CPU on a real-time factor of $0.87$--$0.95$, including
input/output, feature extraction, on a frame shift of $10$ ms, a window length
of $27.5$ ms, and $2$ lookup frames.
Related papers
- Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS [1.8927791081850118]
This paper introduces R-MelNet, a two-part autoregressive architecture with a backend WaveRNN-style audio decoder.
The model produces low-resolution mel-spectral features which are used by a WaveRNN decoder to produce an audio waveform.
arXiv Detail & Related papers (2022-06-30T13:29:31Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - High-Fidelity and Low-Latency Universal Neural Vocoder based on
Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform
Modeling [38.828260316517536]
This paper presents a novel universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP)
Experiments demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions.
arXiv Detail & Related papers (2021-05-20T16:02:45Z) - Axial Residual Networks for CycleGAN-based Voice Conversion [0.0]
We propose a novel architecture and improved training objectives for non-parallel voice conversion.
Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram.
We demonstrate via experiments that our proposed model outperforms Scyclone and shows a comparable or better performance to that of CycleGAN-VC2 even without employing a neural vocoder.
arXiv Detail & Related papers (2021-02-16T10:55:35Z) - StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with
Temporal Adaptive Normalization [9.866072912049031]
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity.
StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech.
The highly parallelizable speech generation is several times faster than real-time on CPUs and GPU.
arXiv Detail & Related papers (2020-11-03T08:28:47Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.