Low-Resource Multilingual and Zero-Shot Multispeaker TTS
- URL: http://arxiv.org/abs/2210.12223v1
- Date: Fri, 21 Oct 2022 20:03:37 GMT
- Title: Low-Resource Multilingual and Zero-Shot Multispeaker TTS
- Authors: Florian Lux, Julia Koch, Ngoc Thang Vu
- Abstract summary: We show that it is possible for a system to learn speaking a new language using just 5 minutes of training data.
We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker.
- Score: 25.707717591185386
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While neural methods for text-to-speech (TTS) have shown great advances in
modeling multiple speakers, even in zero-shot settings, the amount of data
needed for those approaches is generally not feasible for the vast majority of
the world's over 6,000 spoken languages. In this work, we bring together the
tasks of zero-shot voice cloning and multilingual low-resource TTS. Using the
language agnostic meta learning (LAML) procedure and modifications to a TTS
encoder, we show that it is possible for a system to learn speaking a new
language using just 5 minutes of training data while retaining the ability to
infer the voice of even unseen speakers in the newly learned language. We show
the success of our proposed approach in terms of intelligibility, naturalness
and similarity to target speaker using objective metrics as well as human
studies and provide our code and trained models open source.
Related papers
- A multilingual training strategy for low resource Text to Speech [5.109810774427171]
We investigate whether data from social media can be used for a small TTS dataset construction, and whether cross lingual transfer learning can work with this type of data.
To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language.
Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.
arXiv Detail & Related papers (2024-09-02T12:53:01Z) - Meta Learning Text-to-Speech Synthesis in over 7000 Languages [29.17020696379219]
In this work, we take on the challenging task of building a single text-to-speech synthesis system capable of generating speech in over 7000 languages.
By leveraging a novel integration of massively multilingual pretraining and meta learning, our approach enables zero-shot speech synthesis in languages without any available data.
We aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
arXiv Detail & Related papers (2024-06-10T15:56:52Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice
Conversion for everyone [0.7927630381442314]
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.
We achieve state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset.
It is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality.
arXiv Detail & Related papers (2021-12-04T19:50:29Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.