Baseline System of Voice Conversion Challenge 2020 with Cyclic
Variational Autoencoder and Parallel WaveGAN
- URL: http://arxiv.org/abs/2010.04429v1
- Date: Fri, 9 Oct 2020 08:25:38 GMT
- Title: Baseline System of Voice Conversion Challenge 2020 with Cyclic
Variational Autoencoder and Parallel WaveGAN
- Authors: Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Toda
- Abstract summary: We present a description of the baseline system of Voice Conversion Challenge (VCC) 2020 with a cyclic variational autoencoder (CycleVAE) and Parallel WaveGAN (PWG)
The results of VCC 2020 have demonstrated that the CycleVAEPWG baseline achieves the following: 1) a mean opinion score (MOS) of 2.87 in naturalness and a speaker similarity percentage (Sim) of 75.37% for Task 1, and 2) a MOS of 2.56 and a Sim of 56.46% for Task 2.
- Score: 38.21087722927386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a description of the baseline system of Voice
Conversion Challenge (VCC) 2020 with a cyclic variational autoencoder
(CycleVAE) and Parallel WaveGAN (PWG), i.e., CycleVAEPWG. CycleVAE is a
nonparallel VAE-based voice conversion that utilizes converted acoustic
features to consider cyclically reconstructed spectra during optimization. On
the other hand, PWG is a non-autoregressive neural vocoder that is based on a
generative adversarial network for a high-quality and fast waveform generator.
In practice, the CycleVAEPWG system can be straightforwardly developed with the
VCC 2020 dataset using a unified model for both Task 1 (intralingual) and Task
2 (cross-lingual), where our open-source implementation is available at
https://github.com/bigpon/vcc20_baseline_cyclevae. The results of VCC 2020 have
demonstrated that the CycleVAEPWG baseline achieves the following: 1) a mean
opinion score (MOS) of 2.87 in naturalness and a speaker similarity percentage
(Sim) of 75.37% for Task 1, and 2) a MOS of 2.56 and a Sim of 56.46% for Task
2, showing an approximately or nearly average score for naturalness and an
above average score for speaker similarity.
Related papers
- CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain.
In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information.
The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - The NU Voice Conversion System for the Voice Conversion Challenge 2020:
On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural
Vocoders [42.636504426142906]
We present the voice conversion systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC 2020)
We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders.
arXiv Detail & Related papers (2020-10-09T09:19:37Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z) - Acoustic Scene Classification Using Bilinear Pooling on Time-liked and
Frequency-liked Convolution Neural Network [4.131608702779222]
This paper explores the use of harmonic and percussive source separation (HPSS) to split the audio into harmonic audio and percussive audio.
The deep features extracted from these 2 CNNs will then be combined using bilinear pooling.
The model is being evaluated on DCASE 2019 sub task 1a dataset and scored an average of 65% on development dataset.
arXiv Detail & Related papers (2020-02-14T04:06:32Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.