The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS
- URL: http://arxiv.org/abs/2010.02434v1
- Date: Tue, 6 Oct 2020 02:27:38 GMT
- Title: The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS
- Authors: Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda
- Abstract summary: This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
- Score: 66.06385966689965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the sequence-to-sequence (seq2seq) baseline system for
the voice conversion challenge (VCC) 2020. We consider a naive approach for
voice conversion (VC), which is to first transcribe the input speech with an
automatic speech recognition (ASR) model, followed using the transcriptions to
generate the voice of the target with a text-to-speech (TTS) model. We revisit
this method under a sequence-to-sequence (seq2seq) framework by utilizing
ESPnet, an open-source end-to-end speech processing toolkit, and the many
well-configured pretrained models provided by the community. Official
evaluation results show that our system comes out top among the participating
systems in terms of conversion similarity, demonstrating the promising ability
of seq2seq models to convert speaker identity. The implementation is made
open-source at: https://github.com/espnet/espnet/tree/master/egs/vcc20.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - Non-autoregressive sequence-to-sequence voice conversion [47.521186595305984]
This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.
We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence.
arXiv Detail & Related papers (2021-04-14T11:53:51Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - The NU Voice Conversion System for the Voice Conversion Challenge 2020:
On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural
Vocoders [42.636504426142906]
We present the voice conversion systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC 2020)
We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders.
arXiv Detail & Related papers (2020-10-09T09:19:37Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.