Automatic Speech Disentanglement for Voice Conversion using Rank Module
and Speech Augmentation
- URL: http://arxiv.org/abs/2306.12259v1
- Date: Wed, 21 Jun 2023 13:28:06 GMT
- Title: Automatic Speech Disentanglement for Voice Conversion using Rank Module
and Speech Augmentation
- Authors: Zhonghua Liu, Shijun Wang, Ning Chen
- Abstract summary: Voice Conversion (VC) converts the voice of a source speech to that of a target while maintaining the source's content.
We propose a VC model that can automatically disentangle speech into four components using only two augmentation functions.
- Score: 4.961389445237138
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Voice Conversion (VC) converts the voice of a source speech to that of a
target while maintaining the source's content. Speech can be mainly decomposed
into four components: content, timbre, rhythm and pitch. Unfortunately, most
related works only take into account content and timbre, which results in less
natural speech. Some recent works are able to disentangle speech into several
components, but they require laborious bottleneck tuning or various
hand-crafted features, each assumed to contain disentangled speech information.
In this paper, we propose a VC model that can automatically disentangle speech
into four components using only two augmentation functions, without the
requirement of multiple hand-crafted features or laborious bottleneck tuning.
The proposed model is straightforward yet efficient, and the empirical results
demonstrate that our model can achieve a better performance than the baseline,
regarding disentanglement effectiveness and speech naturalness.
Related papers
- SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition [67.08798754009153]
Speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.
We propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens.
arXiv Detail & Related papers (2024-01-31T18:06:29Z) - Voxtlm: unified decoder-only models for consolidating speech
recognition/synthesis and speech/text continuation tasks [61.3055230762097]
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.
VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning.
arXiv Detail & Related papers (2023-09-14T03:13:18Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion [63.346825713704625]
Text-to-speech (TTS) and voice conversion (VC) are two different tasks aiming at generating high quality speaking voice according to different input modality.
This paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time.
arXiv Detail & Related papers (2023-01-10T06:06:57Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Adversarially learning disentangled speech representations for robust
multi-factor voice conversion [39.91395314356084]
We propose a disentangled speech representation learning framework based on adversarial learning.
Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled.
Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors.
arXiv Detail & Related papers (2021-01-30T08:29:55Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.