TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and
Adversarial Training
- URL: http://arxiv.org/abs/2208.04035v1
- Date: Mon, 8 Aug 2022 10:33:36 GMT
- Title: TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and
Adversarial Training
- Authors: Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Zhen Zeng,
Edward Xiao, Jing Xiao
- Abstract summary: Novel voice conversion framework named $boldsymbol T$ext $boldsymbol G$uided $boldsymbol A$utoVC(TGAVC)
adversarial training is applied to eliminate the speaker identity information in the estimated content embedding extracted from speech.
Experiments on AIShell-3 dataset show that the proposed model outperforms AutoVC in terms of naturalness and similarity of converted speech.
- Score: 32.35100329067037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-parallel many-to-many voice conversion remains an interesting but
challenging speech processing task. Recently, AutoVC, a conditional autoencoder
based method, achieved excellent conversion results by disentangling the
speaker identity and the speech content using information-constraining
bottlenecks. However, due to the pure autoencoder training method, it is
difficult to evaluate the separation effect of content and speaker identity. In
this paper, a novel voice conversion framework, named $\boldsymbol T$ext
$\boldsymbol G$uided $\boldsymbol A$utoVC(TGAVC), is proposed to more
effectively separate content and timbre from speech, where an expected content
embedding produced based on the text transcriptions is designed to guide the
extraction of voice content. In addition, the adversarial training is applied
to eliminate the speaker identity information in the estimated content
embedding extracted from speech. Under the guidance of the expected content
embedding and the adversarial training, the content encoder is trained to
extract speaker-independent content embedding from speech. Experiments on
AIShell-3 dataset show that the proposed model outperforms AutoVC in terms of
naturalness and similarity of converted speech.
Related papers
- SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples.
We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model.
Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z) - Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice"
The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Adversarial Speaker Disentanglement Using Unannotated External Data for
Self-supervised Representation Based Voice Conversion [35.23123094710891]
We propose a high-similarity any-to-one voice conversion method with the input of SSL representations.
Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method.
arXiv Detail & Related papers (2023-05-16T04:52:29Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.