Toward Degradation-Robust Voice Conversion
- URL: http://arxiv.org/abs/2110.07537v1
- Date: Thu, 14 Oct 2021 17:00:34 GMT
- Title: Toward Degradation-Robust Voice Conversion
- Authors: Chien-yu Huang, Kai-Wei Chang, Hung-yi Lee
- Abstract summary: Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training.
It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.
We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
- Score: 94.60503904292916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Any-to-any voice conversion technologies convert the vocal timbre of an
utterance to any speaker even unseen during training. Although there have been
several state-of-the-art any-to-any voice conversion models, they were all
based on clean utterances to convert successfully. However, in real-world
scenarios, it is difficult to collect clean utterances of a speaker, and they
are usually degraded by noises or reverberations. It thus becomes highly
desired to understand how these degradations affect voice conversion and build
a degradation-robust model. We report in this paper the first comprehensive
study on the degradation robustness of any-to-any voice conversion. We show
that the performance of state-of-the-art models nowadays was severely hampered
given degraded utterances. To this end, we then propose speech enhancement
concatenation and denoising training to improve the robustness. In addition to
common degradations, we also consider adversarial noises, which alter the model
output significantly yet are human-imperceptible. It was shown that both
concatenations with off-the-shelf speech enhancement models and denoising
training on voice conversion models could improve the robustness, while each of
them had pros and cons.
Related papers
- Zero-shot Voice Conversion with Diffusion Transformers [0.0]
Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker.
Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks.
We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training.
arXiv Detail & Related papers (2024-11-15T04:43:44Z) - SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples.
We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model.
Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Defending Your Voice: Adversarial Attack on Voice Conversion [70.19396655909455]
We report the first known attempt to perform adversarial attack on voice conversion.
We introduce human noise imperceptible into the utterances of a speaker whose voice is to be defended.
It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker.
arXiv Detail & Related papers (2020-05-18T14:51:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.