Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE
- URL: http://arxiv.org/abs/2203.16037v1
- Date: Wed, 30 Mar 2022 03:52:42 GMT
- Title: Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE
- Authors: Ziang Long, Yunling Zheng, Meng Yu, Jack Xin
- Abstract summary: Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
- Score: 8.144263449781967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Variational auto-encoder(VAE) is an effective neural network architecture to
disentangle a speech utterance into speaker identity and linguistic content
latent embeddings, then generate an utterance for a target speaker from that of
a source speaker. This is possible by concatenating the identity embedding of
the target speaker and the content embedding of the source speaker uttering a
desired sentence. In this work, we found a suitable location of VAE's decoder
to add a self-attention layer for incorporating non-local information in
generating a converted utterance and hiding the source speaker's identity. In
experiments of zero-shot many-to-many voice conversion task on VCTK data set,
the self-attention layer enhances speaker classification accuracy on unseen
speakers by 27\% while increasing the decoder parameter size by 12\%. The voice
quality of converted utterance degrades by merely 3\% measured by the MOSNet
scores. To reduce over-fitting and generalization error, we further applied a
relaxed group-wise splitting method in network training and achieved a gain of
speaker classification accuracy on unseen speakers by 46\% while maintaining
the conversion voice quality in terms of MOSNet scores. Our encouraging
findings point to future research on integrating more variety of attention
structures in VAE framework for advancing zero-shot many-to-many voice
conversions.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion [0.0]
DeID-VC is a speaker de-identification system that converts a real speaker to pseudo speakers.
With the help of PSG, DeID-VC can assign unique pseudo speakers at speaker level or even at utterance level.
arXiv Detail & Related papers (2022-09-09T21:13:08Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - SpEx: Multi-Scale Time Domain Speaker Extraction Network [89.00319878262005]
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment.
It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra.
We propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
arXiv Detail & Related papers (2020-04-17T16:13:06Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.