Non-autoregressive sequence-to-sequence voice conversion
- URL: http://arxiv.org/abs/2104.06793v1
- Date: Wed, 14 Apr 2021 11:53:51 GMT
- Title: Non-autoregressive sequence-to-sequence voice conversion
- Authors: Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
- Abstract summary: This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.
We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence.
- Score: 47.521186595305984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel voice conversion (VC) method based on
non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great
success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend
the FastSpeech2 model for the VC problem. We introduce the
convolution-augmented Transformer (Conformer) instead of the Transformer,
making it possible to capture both local and global context information from
the input sequence. Furthermore, we extend variance predictors to variance
converters to explicitly convert the source speaker's prosody components such
as pitch and energy into the target speaker. The experimental evaluation with
the Japanese speaker dataset, which consists of male and female speakers of
1,000 utterances, demonstrates that the proposed model enables us to perform
more stable, faster, and better conversion than autoregressive S2S (AR-S2S)
models such as Tacotron2 and Transformer.
Related papers
- DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z) - Structured State Space Decoder for Speech Recognition and Synthesis [9.354721572095272]
A structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks.
In this study, we applied S4 as a decoder for ASR and text-to-speech tasks by comparing it with the Transformer decoder.
For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25%.
arXiv Detail & Related papers (2022-10-31T06:54:23Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - The NU Voice Conversion System for the Voice Conversion Challenge 2020:
On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural
Vocoders [42.636504426142906]
We present the voice conversion systems developed at Nagoya University (NU) for the Voice Conversion Challenge 2020 (VCC 2020)
We aim to determine the effectiveness of two recent significant technologies in VC: sequence-to-sequence (seq2seq) models and autoregressive (AR) neural vocoders.
arXiv Detail & Related papers (2020-10-09T09:19:37Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model.
We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.