The NU Voice Conversion System for the Voice Conversion Challenge 2020:
On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural
Vocoders
- URL: http://arxiv.org/abs/2010.04446v1
- Date: Fri, 9 Oct 2020 09:19:37 GMT
- Title: The NU Voice Conversion System for the Voice Conversion Challenge 2020:
On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural
Vocoders
- Authors: Wen-Chin Huang, Patrick Lumban Tobing, Yi-Chiao Wu, Kazuhiro
Kobayashi, Tomoki Toda
- Abstract summary: We present the voice conversion systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC 2020)
We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders.
- Score: 42.636504426142906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present the voice conversion (VC) systems developed at
Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC2020). We
aim to determine the effectiveness of two recent significant technologies in
VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural
vocoders. Two respective systems were developed for the two tasks in the
challenge: for task 1, we adopted the Voice Transformer Network, a
Transformer-based seq2seq VC model, and extended it with synthetic parallel
data to tackle nonparallel data; for task 2, we used the frame-based cyclic
variational autoencoder (CycleVAE) to model the spectral features of a speech
waveform and the AR WaveNet vocoder with additional fine-tuning. By comparing
with the baseline systems, we confirmed that the seq2seq modeling can improve
the conversion similarity and that the use of AR vocoders can improve the
naturalness of the converted speech.
Related papers
- Non-autoregressive sequence-to-sequence voice conversion [47.521186595305984]
This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.
We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence.
arXiv Detail & Related papers (2021-04-14T11:53:51Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - Baseline System of Voice Conversion Challenge 2020 with Cyclic
Variational Autoencoder and Parallel WaveGAN [38.21087722927386]
We present a description of the baseline system of Voice Conversion Challenge (VCC) 2020 with a cyclic variational autoencoder (CycleVAE) and Parallel WaveGAN (PWG)
The results of VCC 2020 have demonstrated that the CycleVAEPWG baseline achieves the following: 1) a mean opinion score (MOS) of 2.87 in naturalness and a speaker similarity percentage (Sim) of 75.37% for Task 1, and 2) a MOS of 2.56 and a Sim of 56.46% for Task 2.
arXiv Detail & Related papers (2020-10-09T08:25:38Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.