NAUTILUS: a Versatile Voice Cloning System
- URL: http://arxiv.org/abs/2005.11004v2
- Date: Wed, 7 Oct 2020 01:12:22 GMT
- Title: NAUTILUS: a Versatile Voice Cloning System
- Authors: Hieu-Thi Luong, Junichi Yamagishi
- Abstract summary: NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
- Score: 44.700803634034486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel speech synthesis system, called NAUTILUS, that can
generate speech with a target voice either from a text input or a reference
utterance of an arbitrary source speaker. By using a multi-speaker speech
corpus to train all requisite encoders and decoders in the initial training
stage, our system can clone unseen voices using untranscribed speech of target
speakers on the basis of the backpropagation algorithm. Moreover, depending on
the data circumstance of the target speaker, the cloning strategy can be
adjusted to take advantage of additional data and modify the behaviors of
text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the
situation. We test the performance of the proposed framework by using deep
convolution layers to model the encoders, decoders and WaveNet vocoder.
Evaluations show that it achieves comparable quality with state-of-the-art TTS
and VC systems when cloning with just five minutes of untranscribed speech.
Moreover, it is demonstrated that the proposed framework has the ability to
switch between TTS and VC with high speaker consistency, which will be useful
for many applications.
Related papers
- Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module [16.369219400819134]
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
arXiv Detail & Related papers (2022-02-16T16:12:21Z) - Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data [11.18504333789534]
We propose to use low-quality code-switched found data from the non-target speakers to achieve cross-lingual voice cloning for the target speakers.
Experiments show that our proposed method can generate high-quality code-switched speech in the target voices in terms of both naturalness and speaker consistency.
arXiv Detail & Related papers (2021-10-14T08:16:06Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice
Conversion without Parallel Data [5.249587285519702]
Cotatron is a transcription-guided speech encoder for speaker-independent linguistic representation.
We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods.
Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance.
arXiv Detail & Related papers (2020-05-07T07:37:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.