SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
- URL: http://arxiv.org/abs/2206.12132v1
- Date: Fri, 24 Jun 2022 07:53:05 GMT
- Title: SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
- Authors: Hyunjae Cho, Wonbin Jung, Junhyeok Lee, Sang Hoon Woo
- Abstract summary: SANE-TTS is a stable and natural end-to-end multilingual TTS model.
We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis.
Our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis.
- Score: 0.3277163122167433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present SANE-TTS, a stable and natural end-to-end
multilingual TTS model. By the difficulty of obtaining multilingual corpus for
given speaker, training multilingual TTS model with monolingual corpora is
unavoidable. We introduce speaker regularization loss that improves speech
naturalness during cross-lingual synthesis as well as domain adversarial
training, which is applied in other multilingual TTS models. Furthermore, by
adding speaker regularization loss, replacing speaker embedding with zero
vector in duration predictor stabilizes cross-lingual inference. With this
replacement, our model generates speeches with moderate rhythm regardless of
source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves
naturalness score above 3.80 both in cross-lingual and intralingual synthesis,
where the ground truth score is 3.99. Also, SANE-TTS maintains speaker
similarity close to that of ground truth even in cross-lingual inference. Audio
samples are available on our web page.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres.
We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style.
By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised
representations [27.157701195636477]
ParrotTTS is a modularized text-to-speech synthesis model.
It can train a multi-speaker variant effectively using transcripts from a single speaker.
It adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone.
arXiv Detail & Related papers (2023-03-01T17:23:12Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - Improving Cross-lingual Speech Synthesis with Triplet Training Scheme [5.470211567548067]
Triplet training scheme is proposed to enhance the cross-lingual pronunciation.
The proposed method brings significant improvement in both intelligibility and naturalness of the synthesized cross-lingual speech.
arXiv Detail & Related papers (2022-02-22T08:40:43Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.