Rapid Speaker Adaptation in Low Resource Text to Speech Systems using
Synthetic Data and Transfer learning
- URL: http://arxiv.org/abs/2312.01107v1
- Date: Sat, 2 Dec 2023 10:52:00 GMT
- Title: Rapid Speaker Adaptation in Low Resource Text to Speech Systems using
Synthetic Data and Transfer learning
- Authors: Raviraj Joshi, Nikesh Garera
- Abstract summary: We propose a transfer learning approach using high-resource language data and synthetically generated data.
We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi.
- Score: 6.544954579068865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-speech (TTS) systems are being built using end-to-end deep learning
approaches. However, these systems require huge amounts of training data. We
present our approach to built production quality TTS and perform speaker
adaptation in extremely low resource settings. We propose a transfer learning
approach using high-resource language data and synthetically generated data. We
transfer the learnings from the out-domain high-resource English language.
Further, we make use of out-of-the-box single-speaker TTS in the target
language to generate in-domain synthetic data. We employ a three-step approach
to train a high-quality single-speaker TTS system in a low-resource Indian
language Hindi. We use a Tacotron2 like setup with a spectrogram prediction
network and a waveglow vocoder. The Tacotron2 acoustic model is trained on
English data, followed by synthetic Hindi data from the existing TTS system.
Finally, the decoder of this model is fine-tuned on only 3 hours of target
Hindi speaker data to enable rapid speaker adaptation. We show the importance
of this dual pre-training and decoder-only fine-tuning using subjective MOS
evaluation. Using transfer learning from high-resource language and synthetic
corpus we present a low-cost solution to train a custom TTS model.
Related papers
- Code-Mixed Text to Speech Synthesis under Low-Resource Constraints [6.544954579068865]
We describe our approaches for production quality code-mixed Hindi-English TTS systems built for e-commerce applications.
We propose a data-oriented approach by utilizing monolingual data sets in individual languages.
We show that such single script bi-lingual training without any code-mixing works well for pure code-mixed test sets.
arXiv Detail & Related papers (2023-12-02T10:40:38Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Towards Building Text-To-Speech Systems for the Next Billion Users [18.290165216270452]
We evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages.
We train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores.
arXiv Detail & Related papers (2022-11-17T13:59:34Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Semi-supervised transfer learning for language expansion of end-to-end
speech recognition models to low-resource languages [19.44975351652865]
We propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages.
We leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively.
Overall, our two-pass speech recognition system with a Monotonic Chunkwise Attention (MoA) in the first pass achieves a WER reduction of 42% relative to the baseline.
arXiv Detail & Related papers (2021-11-19T05:09:16Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker
Adaptation and Pronunciation Enhancement [1.7704011486040843]
We show that one can transfer an existing TTS model for new speakers from the same or a different language using only 20 minutes of data.
We first introduce a base multi-lingual Tacotron with language-agnostic input, then demonstrate how transfer learning is done for different scenarios of speaker adaptation.
arXiv Detail & Related papers (2020-11-12T14:05:34Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.