Towards Natural and Controllable Cross-Lingual Voice Conversion Based on
Neural TTS Model and Phonetic Posteriorgram
- URL: http://arxiv.org/abs/2102.01991v1
- Date: Wed, 3 Feb 2021 10:28:07 GMT
- Title: Towards Natural and Controllable Cross-Lingual Voice Conversion Based on
Neural TTS Model and Phonetic Posteriorgram
- Authors: Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma
- Abstract summary: Cross-lingual voice conversion is a challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages.
We build upon the neural text-to-speech (TTS) model to design a new cross-lingual VC framework named FastSpeech-VC.
- Score: 21.652906261475533
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Cross-lingual voice conversion (VC) is an important and challenging problem
due to significant mismatches of the phonetic set and the speech prosody of
different languages. In this paper, we build upon the neural text-to-speech
(TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new
cross-lingual VC framework named FastSpeech-VC. We address the mismatches of
the phonetic set and the speech prosody by applying Phonetic PosteriorGrams
(PPGs), which have been proved to bridge across speaker and language
boundaries. Moreover, we add normalized logarithm-scale fundamental frequency
(Log-F0) to further compensate for the prosodic mismatches and significantly
improve naturalness. Our experiments on English and Mandarin languages
demonstrate that with only mono-lingual corpus, the proposed FastSpeech-VC can
achieve high quality converted speech with mean opinion score (MOS) close to
the professional records while maintaining good speaker similarity. Compared to
the baselines using Tacotron2 and Transformer TTS models, the FastSpeech-VC can
achieve controllable converted speech rate and much faster inference speed.
More importantly, the FastSpeech-VC can easily be adapted to a speaker with
limited training utterances.
Related papers
- Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion [63.346825713704625]
Text-to-speech (TTS) and voice conversion (VC) are two different tasks aiming at generating high quality speaking voice according to different input modality.
This paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time.
arXiv Detail & Related papers (2023-01-10T06:06:57Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Transfer Learning from Monolingual ASR to Transcription-free
Cross-lingual Voice Conversion [0.0]
Cross-lingual voice conversion is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages.
In this paper, we focus on knowledge transfer from monolin-gual ASR to cross-lingual VC.
We successfully address cross-lingual VC without any transcription or language-specific knowledge for foreign speech.
arXiv Detail & Related papers (2020-09-30T13:44:35Z) - MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model.
We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.