Assem-VC: Realistic Voice Conversion by Assembling Modern Speech
Synthesis Techniques
- URL: http://arxiv.org/abs/2104.00931v1
- Date: Fri, 2 Apr 2021 08:18:05 GMT
- Title: Assem-VC: Realistic Voice Conversion by Assembling Modern Speech
Synthesis Techniques
- Authors: Kang-wook Kim, Seung-won Park and Myun-chul Joe
- Abstract summary: We propose Assem-VC, a new state-of-the-art any-to-many non-parallel voice conversion system.
This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs.
- Score: 3.3946853660795893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we pose the current state-of-the-art voice conversion (VC)
systems as two-encoder-one-decoder models. After comparing these models, we
combine the best features and propose Assem-VC, a new state-of-the-art
any-to-many non-parallel VC system. This paper also introduces the GTA
finetuning in VC, which significantly improves the quality and the speaker
similarity of the outputs. Assem-VC outperforms the previous state-of-the-art
approaches in both the naturalness and the speaker similarity on the VCTK
dataset. As an objective result, the degree of speaker disentanglement of
features such as phonetic posteriorgrams (PPG) is also explored. Our
investigation indicates that many-to-many VC results are no longer distinct
from human speech and similar quality can be achieved with any-to-many models.
Audio samples are available at https://mindslab-ai.github.io/assem-vc/
Related papers
- AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - FastVC: Fast Voice Conversion with non-parallel data [13.12834490248018]
This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC)
FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all.
Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.
arXiv Detail & Related papers (2020-10-08T18:05:30Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.