Mix and Match: An Empirical Study on Training Corpus Composition for
Polyglot Text-To-Speech (TTS)
- URL: http://arxiv.org/abs/2207.01507v1
- Date: Mon, 4 Jul 2022 15:23:06 GMT
- Title: Mix and Match: An Empirical Study on Training Corpus Composition for
Polyglot Text-To-Speech (TTS)
- Authors: Ziyao Zhang, Alessio Falai, Ariadna Sanchez, Orazio Angelini, Kayoko
Yanagisawa
- Abstract summary: Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual corpora has emerged as a popular way for building voice cloning based Polyglot NTTS systems.
It is essential to understand how the composition of the training corpora affects the quality of multilingual speech synthesis.
- Score: 3.57486761615991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training multilingual Neural Text-To-Speech (NTTS) models using only
monolingual corpora has emerged as a popular way for building voice cloning
based Polyglot NTTS systems. In order to train these models, it is essential to
understand how the composition of the training corpora affects the quality of
multilingual speech synthesis. In this context, it is common to hear questions
such as "Would including more Spanish data help my Italian synthesis, given the
closeness of both languages?". Unfortunately, we found existing literature on
the topic lacking in completeness in this regard. In the present work, we
conduct an extensive ablation study aimed at understanding how various factors
of the training corpora, such as language family affiliation, gender
composition, and the number of speakers, contribute to the quality of Polyglot
synthesis. Our findings include the observation that female speaker data are
preferred in most scenarios, and that it is not always beneficial to have more
speakers from the target language variant in the training corpus. The findings
herein are informative for the process of data procurement and corpora
building.
Related papers
- CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation [25.82932373649325]
CrossSpeech++ is a method to disentangle language and speaker information.
It significantly improves the quality of cross-lingual speech synthesis.
We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements.
arXiv Detail & Related papers (2024-12-28T06:32:49Z) - A multilingual training strategy for low resource Text to Speech [5.109810774427171]
We investigate whether data from social media can be used for a small TTS dataset construction, and whether cross lingual transfer learning can work with this type of data.
To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language.
Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.
arXiv Detail & Related papers (2024-09-02T12:53:01Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Multilingual Multiaccented Multispeaker TTS with RADTTS [21.234787964238645]
We present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS.
We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents.
arXiv Detail & Related papers (2023-01-24T22:39:04Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages.
With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.