Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
- URL: http://arxiv.org/abs/2205.05227v1
- Date: Wed, 11 May 2022 01:19:42 GMT
- Title: Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
- Authors: Jiachen Lian and Chunlei Zhang and Gopala Krishna Anumanchipalli and
Dong Yu
- Abstract summary: Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion.
We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling.
We demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy.
- Score: 30.376259456529368
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Disentangling content and speaking style information is essential for
zero-shot non-parallel voice conversion (VC). Our previous study investigated a
novel framework with disentangled sequential variational autoencoder (DSVAE) as
the backbone for information decomposition. We have demonstrated that
simultaneous disentangling content embedding and speaker embedding from one
utterance is feasible for zero-shot VC. In this study, we continue the
direction by raising one concern about the prior distribution of content branch
in the DSVAE baseline. We find the random initialized prior distribution will
force the content embedding to reduce the phonetic-structure information during
the learning process, which is not a desired property. Here, we seek to achieve
a better content embedding with more phonetic information preserved. We propose
conditional DSVAE, a new model that enables content bias as a condition to the
prior modeling and reshapes the content embedding sampled from the posterior
distribution. In our experiment on the VCTK dataset, we demonstrate that
content embeddings derived from the conditional DSVAE overcome the randomness
and achieve a much better phoneme classification accuracy, a stabilized
vocalization and a better zero-shot VC performance compared with the
competitive DSVAE baseline.
Related papers
- CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching [7.144608815694702]
CTEFM-VC is a framework that disentangles utterances into linguistic content and timbre representations.
To enhance its timbre modeling capability and the naturalness of generated speech, we propose a context-aware timbre ensemble modeling approach.
arXiv Detail & Related papers (2024-11-04T12:23:17Z) - Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling.
Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion [17.274784447811665]
We adopt the end-to-end framework of VITS for high-quality waveform reconstruction.
We disentangle content information by imposing an information bottleneck to WavLM features.
We propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.
arXiv Detail & Related papers (2022-10-27T13:32:38Z) - Speech Representation Disentanglement with Adversarial Mutual
Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic.
We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components.
Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via
Intermediary Latents [60.785317191131284]
We introduce a simple and effective method for learning VAEs with controllable biases by using an intermediary set of latent variables.
In particular, it allows us to impose desired properties like sparsity or clustering on learned representations.
We show that this, in turn, allows InteL-VAEs to learn both better generative models and representations.
arXiv Detail & Related papers (2021-06-25T16:34:05Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.