LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition
- URL: http://arxiv.org/abs/2008.03687v1
- Date: Sun, 9 Aug 2020 08:16:33 GMT
- Title: LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition
- Authors: Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, Tie-Yan Liu
- Abstract summary: We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
- Score: 148.43282526983637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech synthesis (text to speech, TTS) and recognition (automatic speech
recognition, ASR) are important speech tasks, and require a large amount of
text and speech pairs for model training. However, there are more than 6,000
languages in the world and most languages are lack of speech training data,
which poses significant challenges when building TTS and ASR systems for
extremely low-resource languages. In this paper, we develop LRSpeech, a TTS and
ASR system under the extremely low-resource setting, which can support rare
languages with low data cost. LRSpeech consists of three key techniques: 1)
pre-training on rich-resource languages and fine-tuning on low-resource
languages; 2) dual transformation between TTS and ASR to iteratively boost the
accuracy of each other; 3) knowledge distillation to customize the TTS model on
a high-quality target-speaker voice and improve the ASR model on multiple
voices. We conduct experiments on an experimental language (English) and a
truly low-resource language (Lithuanian) to verify the effectiveness of
LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for
TTS in terms of both intelligibility (more than 98% intelligibility rate) and
naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech,
which satisfy the requirements for industrial deployment, 2) achieves promising
recognition accuracy for ASR, and 3) last but not least, uses extremely
low-resource training data. We also conduct comprehensive analyses on LRSpeech
with different amounts of data resources, and provide valuable insights and
guidances for industrial deployment. We are currently deploying LRSpeech into a
commercialized cloud speech service to support TTS on more rare languages.
Related papers
- ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams [16.172599163455693]
We leverage high-quality data from linguistically or geographically related languages to improve TTS for the target language.
Second, we utilize low-quality Automatic Speech Recognition (ASR) data recorded in non-studio environments, which is refined using denoising and speech enhancement models.
Third, we apply knowledge distillation from large-scale models using synthetic data to generate more robust outputs.
arXiv Detail & Related papers (2024-10-23T14:18:25Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - Low-Resource Multilingual and Zero-Shot Multispeaker TTS [25.707717591185386]
We show that it is possible for a system to learn speaking a new language using just 5 minutes of training data.
We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker.
arXiv Detail & Related papers (2022-10-21T20:03:37Z) - When Is TTS Augmentation Through a Pivot Language Useful? [26.084140117526488]
We propose to produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language.
Using several thousand synthetic TTS text-speech pairs and duplicating authentic data to balance yields optimal results.
Application of these findings improves ASR by 64.5% and 45.0% character error reduction rate (CERR) respectively for two low-resource languages.
arXiv Detail & Related papers (2022-07-20T13:33:41Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Bootstrap an end-to-end ASR system by multilingual training, transfer
learning, text-to-text mapping and synthetic audio [8.510792628268824]
bootstrapping speech recognition on limited data resources has been an area of active research for long.
We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime.
Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
arXiv Detail & Related papers (2020-11-25T13:11:32Z) - Improving Cross-Lingual Transfer Learning for End-to-End Speech
Recognition with Speech Translation [63.16500026845157]
We introduce speech-to-text translation as an auxiliary task to incorporate additional knowledge of the target language.
We show that training ST with human translations is not necessary.
Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
arXiv Detail & Related papers (2020-06-09T19:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.