Related papers: Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

URL: http://arxiv.org/abs/2208.08757v1
Date: Thu, 18 Aug 2022 10:36:27 GMT
Title: Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion
Authors: SiCheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu, Aolan Sun, Jianzong Wang, ning cheng, Huaizhen Tang, Xintao Zhao, Jie Wang and Helen Meng
Abstract summary: One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
Score: 42.43123253495082
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/.

Related papers

SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations [12.423959479216895]
One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance. Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech. We propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes.
arXiv Detail & Related papers (2024-11-25T07:14:26Z)
Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling. Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
arXiv Detail & Related papers (2024-10-02T09:07:33Z)
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired. We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units [27.619740864818453]
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train.
arXiv Detail & Related papers (2022-12-19T18:53:04Z)
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning. A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z)
Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker. We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. We propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z)
Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework. It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.