Speech Representation Disentanglement with Adversarial Mutual
Information Learning for One-shot Voice Conversion
- URL: http://arxiv.org/abs/2208.08757v1
- Date: Thu, 18 Aug 2022 10:36:27 GMT
- Title: Speech Representation Disentanglement with Adversarial Mutual
Information Learning for One-shot Voice Conversion
- Authors: SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan
Sun, Jianzong Wang, ning cheng, Huaizhen Tang, Xintao Zhao, Jie Wang and
Helen Meng
- Abstract summary: One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic.
We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components.
Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
- Score: 42.43123253495082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One-shot voice conversion (VC) with only a single target speaker's speech for
reference has become a hot research topic. Existing works generally disentangle
timbre, while information about pitch, rhythm and content is still mixed
together. To perform one-shot VC effectively with further disentangling these
speech components, we employ random resampling for pitch and content encoder
and use the variational contrastive log-ratio upper bound of mutual information
and gradient reversal layer based adversarial mutual information learning to
ensure the different parts of the latent space containing only the desired
disentangled representation during training. Experiments on the VCTK dataset
show the model achieves state-of-the-art performance for one-shot VC in terms
of naturalness and intellgibility. In addition, we can transfer characteristics
of one-shot VC on timbre, pitch and rhythm separately by speech representation
disentanglement. Our code, pre-trained models and demo are available at
https://im1eon.github.io/IS2022-SRDVC/.
Related papers
- Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling.
Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Speaking Style Conversion in the Waveform Domain Using Discrete
Self-Supervised Units [27.619740864818453]
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner.
The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train.
arXiv Detail & Related papers (2022-12-19T18:53:04Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.