UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion
- URL: http://arxiv.org/abs/2301.03801v1
- Date: Tue, 10 Jan 2023 06:06:57 GMT
- Title: UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion
- Authors: Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Jianhua Tao
- Abstract summary: Text-to-speech (TTS) and voice conversion (VC) are two different tasks aiming at generating high quality speaking voice according to different input modality.
This paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time.
- Score: 63.346825713704625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-speech (TTS) and voice conversion (VC) are two different tasks both
aiming at generating high quality speaking voice according to different input
modality. Due to their similarity, this paper proposes UnifySpeech, which
brings TTS and VC into a unified framework for the first time. The model is
based on the assumption that speech can be decoupled into three independent
components: content information, speaker information, prosody information. Both
TTS and VC can be regarded as mining these three parts of information from the
input and completing the reconstruction of speech. For TTS, the speech content
information is derived from the text, while in VC it's derived from the source
speech, so all the remaining units are shared except for the speech content
extraction module in the two tasks. We applied vector quantization and domain
constrain to bridge the gap between the content domains of TTS and VC.
Objective and subjective evaluation shows that by combining the two task, TTS
obtains better speaker modeling ability while VC gets hold of impressive speech
content decoupling capability.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module [16.369219400819134]
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
arXiv Detail & Related papers (2022-02-16T16:12:21Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Towards Natural and Controllable Cross-Lingual Voice Conversion Based on
Neural TTS Model and Phonetic Posteriorgram [21.652906261475533]
Cross-lingual voice conversion is a challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages.
We build upon the neural text-to-speech (TTS) model to design a new cross-lingual VC framework named FastSpeech-VC.
arXiv Detail & Related papers (2021-02-03T10:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.