Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training
- URL: http://arxiv.org/abs/2409.01668v2
- Date: Fri, 6 Sep 2024 08:24:19 GMT
- Title: Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training
- Authors: Wenhan Yao, Zedong Xing, Xiarun Chen, Jia Liu, Yongqiang He, Weiping Wen,
- Abstract summary: One-shot voice conversion aims to change the timbre of any source speech to match that of the target speaker with only one speech sample.
Existing style transfer-based VC methods relied on speech representation disentanglement.
We propose Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder.
- Score: 3.9306467064810438
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics effectively into the generated speech. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.
Related papers
- Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling.
Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples.
We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model.
Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z) - Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice"
The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z) - Speaking Style Conversion in the Waveform Domain Using Discrete
Self-Supervised Units [27.619740864818453]
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner.
The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train.
arXiv Detail & Related papers (2022-12-19T18:53:04Z) - Speech Representation Disentanglement with Adversarial Mutual
Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic.
We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components.
Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.