NVC-Net: End-to-End Adversarial Voice Conversion
- URL: http://arxiv.org/abs/2106.00992v1
- Date: Wed, 2 Jun 2021 07:19:58 GMT
- Title: NVC-Net: End-to-End Adversarial Voice Conversion
- Authors: Bac Nguyen and Fabien Cardinaux
- Abstract summary: NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length.
Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
- Score: 7.14505983271756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice conversion has gained increasing popularity in many applications of
speech synthesis. The idea is to change the voice identity from one speaker
into another while keeping the linguistic content unchanged. Many voice
conversion approaches rely on the use of a vocoder to reconstruct the speech
from acoustic features, and as a consequence, the speech quality heavily
depends on such a vocoder. In this paper, we propose NVC-Net, an end-to-end
adversarial network, which performs voice conversion directly on the raw audio
waveform of arbitrary length. By disentangling the speaker identity from the
speech content, NVC-Net is able to perform non-parallel traditional
many-to-many voice conversion as well as zero-shot voice conversion from a
short utterance of an unseen target speaker. Importantly, NVC-Net is
non-autoregressive and fully convolutional, achieving fast inference. Our model
is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA
V100 GPU, being orders of magnitude faster than state-of-the-art methods under
the same hardware configurations. Objective and subjective evaluations on
non-parallel many-to-many voice conversion tasks show that NVC-Net obtains
competitive results with significantly fewer parameters.
Related papers
- Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and
Textually Described Voices [28.998590651956153]
We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion.
We find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion.
Results are more mixed for the musical instrument and text-to-voice conversion tasks.
arXiv Detail & Related papers (2023-10-12T08:00:25Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.