ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech
- URL: http://arxiv.org/abs/2211.03545v1
- Date: Mon, 7 Nov 2022 13:35:16 GMT
- Title: ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech
- Authors: Xiaoran Fan, Chao Pang, Tian Yuan, He Bai, Renjie Zheng, Pengfei Zhu,
Shuohuan Wang, Junkun Chen, Zeyu Chen, Liang Huang, Yu Sun, Hua Wu
- Abstract summary: We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
- Score: 58.93395189153713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech representation learning has improved both speech understanding and
speech synthesis tasks for single language. However, its ability in
cross-lingual scenarios has not been explored. In this paper, we extend the
pretraining method for cross-lingual multi-speaker speech synthesis tasks,
including cross-lingual multi-speaker voice cloning and cross-lingual
multi-speaker speech editing. We propose a speech-text joint pretraining
framework, where we randomly mask the spectrogram and the phonemes given a
speech example and its transcription. By learning to reconstruct the masked
parts of the input in different languages, our model shows great improvements
over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework
is end-to-end for both the training and the inference without any finetuning
effort. In cross-lingual multi-speaker voice cloning and cross-lingual
multi-speaker speech editing tasks, our experiments show that our model
outperforms speaker-embedding-based multi-speaker TTS methods. The code and
model are publicly available at PaddleSpeech.
Related papers
- Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised
representations [27.157701195636477]
ParrotTTS is a modularized text-to-speech synthesis model.
It can train a multi-speaker variant effectively using transcripts from a single speaker.
It adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone.
arXiv Detail & Related papers (2023-03-01T17:23:12Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker
Classifier Joint Training [6.256271702518489]
In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker.
This paper studies a multi-task learning framework to improve the cross-lingual speaker similarity.
arXiv Detail & Related papers (2022-01-20T12:02:58Z) - Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data [11.18504333789534]
We propose to use low-quality code-switched found data from the non-target speakers to achieve cross-lingual voice cloning for the target speakers.
Experiments show that our proposed method can generate high-quality code-switched speech in the target voices in terms of both naturalness and speaker consistency.
arXiv Detail & Related papers (2021-10-14T08:16:06Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.