FastVC: Fast Voice Conversion with non-parallel data
- URL: http://arxiv.org/abs/2010.04185v1
- Date: Thu, 8 Oct 2020 18:05:30 GMT
- Title: FastVC: Fast Voice Conversion with non-parallel data
- Authors: Oriol Barbany Mayor and Milos Cernak
- Abstract summary: This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC)
FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all.
Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.
- Score: 13.12834490248018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces FastVC, an end-to-end model for fast Voice Conversion
(VC). The proposed model can convert speech of arbitrary length from multiple
source speakers to multiple target speakers. FastVC is based on a conditional
AutoEncoder (AE) trained on non-parallel data and requires no annotations at
all. This model's latent representation is shown to be speaker-independent and
similar to phonemes, which is a desirable feature for VC systems. While the
current VC systems primarily focus on achieving the highest overall speech
quality, this paper tries to balance the development concerning resources
needed to run the systems. Despite the simple structure of the proposed model,
it outperforms the VC Challenge 2020 baselines on the cross-lingual task in
terms of naturalness.
Related papers
- Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling.
Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using
Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation [19.807274303199755]
We propose a novel data augmentation method that combines pitch-shifting and VC techniques.
Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models.
Subjective test results showed that a FastSpeech 2-based emotional TTS system with the proposed method improved naturalness and emotional similarity compared with conventional methods.
arXiv Detail & Related papers (2022-04-21T11:03:37Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - Assem-VC: Realistic Voice Conversion by Assembling Modern Speech
Synthesis Techniques [3.3946853660795893]
We propose Assem-VC, a new state-of-the-art any-to-many non-parallel voice conversion system.
This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs.
arXiv Detail & Related papers (2021-04-02T08:18:05Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.