Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and
Textually Described Voices
- URL: http://arxiv.org/abs/2310.08104v1
- Date: Thu, 12 Oct 2023 08:00:25 GMT
- Title: Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and
Textually Described Voices
- Authors: Matthew Baas and Herman Kamper
- Abstract summary: We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion.
We find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion.
Results are more mixed for the musical instrument and text-to-voice conversion tasks.
- Score: 28.998590651956153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice conversion aims to convert source speech into a target voice using
recordings of the target speaker as a reference. Newer models are producing
increasingly realistic output. But what happens when models are fed with
non-standard data, such as speech from a user with a speech impairment? We
investigate how a recent voice conversion model performs on non-standard
downstream voice conversion tasks. We use a simple but robust approach called
k-nearest neighbors voice conversion (kNN-VC). We look at four non-standard
applications: stuttered voice conversion, cross-lingual voice conversion,
musical instrument conversion, and text-to-voice conversion. The latter
involves converting to a target voice specified through a text description,
e.g. "a young man with a high-pitched voice". Compared to an established
baseline, we find that kNN-VC retains high performance in stuttered and
cross-lingual voice conversion. Results are more mixed for the musical
instrument and text-to-voice conversion tasks. E.g., kNN-VC works well on some
instruments like drums but not on others. Nevertheless, this shows that voice
conversion models - and kNN-VC in particular - are increasingly applicable in a
range of non-standard downstream tasks. But there are still limitations when
samples are very far from the training distribution. Code, samples, trained
models: https://rf5.github.io/sacair2023-knnvc-demo/.
Related papers
- Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice"
The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z) - Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale [58.46845567087977]
Voicebox is the most versatile text-guided generative model for speech at scale.
It can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation.
It outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.
arXiv Detail & Related papers (2023-06-23T16:23:24Z) - Voice Conversion With Just Nearest Neighbors [22.835346602837063]
Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference.
We propose k-nearest neighbors voice conversion (kNN-VC), a straightforward yet effective method for any-to-any conversion.
arXiv Detail & Related papers (2023-05-30T12:19:07Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - NVC-Net: End-to-End Adversarial Voice Conversion [7.14505983271756]
NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length.
Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
arXiv Detail & Related papers (2021-06-02T07:19:58Z) - What all do audio transformer models hear? Probing Acoustic
Representations for Language Delivery and its Structure [64.54208910952651]
We compare audio transformer models Mockingjay and wave2vec2.0.
We probe the audio models' understanding of textual surface, syntax, and semantic features.
We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets.
arXiv Detail & Related papers (2021-01-02T06:29:12Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - Vocoder-free End-to-End Voice Conversion with Transformer Network [5.5792083698526405]
Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size.
It is possible to only use the raw spectrum along with the phase to generate different style of voices with clear pronunciation.
In this paper, we introduce a vocoder-free end-to-end voice conversion method using transformer network.
arXiv Detail & Related papers (2020-02-05T06:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.