Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder
- URL: http://arxiv.org/abs/2107.06642v1
- Date: Sun, 11 Jul 2021 13:31:16 GMT
- Title: Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder
- Authors: Manh Luong and Viet Anh Tran
- Abstract summary: We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
- Score: 2.4975981795360847
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice conversion is a challenging task which transforms the voice
characteristics of a source speaker to a target speaker without changing
linguistic content. Recently, there have been many works on many-to-many Voice
Conversion (VC) based on Variational Autoencoder (VAEs) achieving good results,
however, these methods lack the ability to disentangle speaker identity and
linguistic content to achieve good performance on unseen speaker scenarios. In
this paper, we propose a new method based on feature disentanglement to tackle
many to many voice conversion. The method has the capability to disentangle
speaker identity and linguistic content from utterances, it can convert from
many source speakers to many target speakers with a single autoencoder network.
Moreover, it naturally deals with the unseen target speaker scenarios. We
perform both objective and subjective evaluations to show the competitive
performance of our proposed method compared with other state-of-the-art models
in terms of naturalness and target speaker similarity.
Related papers
- SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations [12.423959479216895]
One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance.
Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech.
We propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes.
arXiv Detail & Related papers (2024-11-25T07:14:26Z) - Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Who is Authentic Speaker [4.822108779108675]
Voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes.
It is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly.
This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices.
arXiv Detail & Related papers (2024-04-30T23:41:00Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.