Voice Conversion With Just Nearest Neighbors
- URL: http://arxiv.org/abs/2305.18975v1
- Date: Tue, 30 May 2023 12:19:07 GMT
- Title: Voice Conversion With Just Nearest Neighbors
- Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper
- Abstract summary: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference.
We propose k-nearest neighbors voice conversion (kNN-VC), a straightforward yet effective method for any-to-any conversion.
- Score: 22.835346602837063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Any-to-any voice conversion aims to transform source speech into a target
voice with just a few examples of the target speaker as a reference. Recent
methods produce convincing conversions, but at the cost of increased complexity
-- making results difficult to reproduce and build on. Instead, we keep it
simple. We propose k-nearest neighbors voice conversion (kNN-VC): a
straightforward yet effective method for any-to-any conversion. First, we
extract self-supervised representations of the source and reference speech. To
convert to the target speaker, we replace each frame of the source
representation with its nearest neighbor in the reference. Finally, a
pretrained vocoder synthesizes audio from the converted representation.
Objective and subjective evaluations show that kNN-VC improves speaker
similarity with similar intelligibility scores to existing methods. Code,
samples, trained models: https://bshall.github.io/knn-vc
Related papers
- Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS [52.89324095217975]
Previous approaches on accent conversion mainly aimed at making non-native speech sound more native.
We develop a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker.
arXiv Detail & Related papers (2024-10-19T06:12:31Z) - Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training [3.9306467064810438]
One-shot voice conversion aims to change the timbre of any source speech to match that of the target speaker with only one speech sample.
Existing style transfer-based VC methods relied on speech representation disentanglement.
We propose Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder.
arXiv Detail & Related papers (2024-09-03T07:21:19Z) - Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and
Textually Described Voices [28.998590651956153]
We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion.
We find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion.
Results are more mixed for the musical instrument and text-to-voice conversion tasks.
arXiv Detail & Related papers (2023-10-12T08:00:25Z) - Catch You and I Can: Revealing Source Voiceprint Against Voice
Conversion [0.0]
We make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit.
We develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples.
arXiv Detail & Related papers (2023-02-24T03:33:13Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - NVC-Net: End-to-End Adversarial Voice Conversion [7.14505983271756]
NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length.
Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
arXiv Detail & Related papers (2021-06-02T07:19:58Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.