Related papers: Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

URL: http://arxiv.org/abs/2310.08104v1
Date: Thu, 12 Oct 2023 08:00:25 GMT
Title: Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices
Authors: Matthew Baas and Herman Kamper
Abstract summary: We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion. We find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion. Results are more mixed for the musical instrument and text-to-voice conversion tasks.
Score: 28.998590651956153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Voice conversion aims to convert source speech into a target voice using recordings of the target speaker as a reference. Newer models are producing increasingly realistic output. But what happens when models are fed with non-standard data, such as speech from a user with a speech impairment? We investigate how a recent voice conversion model performs on non-standard downstream voice conversion tasks. We use a simple but robust approach called k-nearest neighbors voice conversion (kNN-VC). We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion. The latter involves converting to a target voice specified through a text description, e.g. "a young man with a high-pitched voice". Compared to an established baseline, we find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion. Results are more mixed for the musical instrument and text-to-voice conversion tasks. E.g., kNN-VC works well on some instruments like drums but not on others. Nevertheless, this shows that voice conversion models - and kNN-VC in particular - are increasingly applicable in a range of non-standard downstream tasks. But there are still limitations when samples are very far from the training distribution. Code, samples, trained models: https://rf5.github.io/sacair2023-knnvc-demo/.

Related papers

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion [0.3749861135832073]
Current approaches to voice conversion tend to struggle in cross-lingual settings.<n>We adopt a simple yet effective approach that combines discrete speech representations with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder.<n>Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages.
arXiv Detail & Related papers (2025-05-22T13:57:02Z)
Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice" The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z)
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale [58.46845567087977]
Voicebox is the most versatile text-guided generative model for speech at scale. It can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. It outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.
arXiv Detail & Related papers (2023-06-23T16:23:24Z)
Voice Conversion With Just Nearest Neighbors [22.835346602837063]
Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. We propose k-nearest neighbors voice conversion (kNN-VC), a straightforward yet effective method for any-to-any conversion.
arXiv Detail & Related papers (2023-05-30T12:19:07Z)
HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline. Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Our model is trained only with 20 English speakers. It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z)
NVC-Net: End-to-End Adversarial Voice Conversion [7.14505983271756]
NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length. Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
arXiv Detail & Related papers (2021-06-02T07:19:58Z)
What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure [64.54208910952651]
We compare audio transformer models Mockingjay and wave2vec2.0. We probe the audio models' understanding of textual surface, syntax, and semantic features. We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets.
arXiv Detail & Related papers (2021-01-02T06:29:12Z)
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model. We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z)
VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity. We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z)
Vocoder-free End-to-End Voice Conversion with Transformer Network [5.5792083698526405]
Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size. It is possible to only use the raw spectrum along with the phase to generate different style of voices with clear pronunciation. In this paper, we introduce a vocoder-free end-to-end voice conversion method using transformer network.
arXiv Detail & Related papers (2020-02-05T06:19:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.