Related papers: Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio

Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio

URL: http://arxiv.org/abs/2011.12696v2
Date: Fri, 18 Jun 2021 12:12:44 GMT
Title: Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio
Authors: Manuel Giollo, Deniz Gunceler, Yulan Liu, Daniel Willett
Abstract summary: bootstrapping speech recognition on limited data resources has been an area of active research for long. We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime. Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
Score: 8.510792628268824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bootstrapping speech recognition on limited data resources has been an area of active research for long. The recent transition to all-neural models and end-to-end (E2E) training brought along particular challenges as these models are known to be data hungry, but also came with opportunities around language-agnostic representations derived from multilingual data as well as shared word-piece output representations across languages that share script and roots. We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer (RNN-T) based automatic speech recognition (ASR) system in the low resource regime, while exploiting the abundant resources available in other languages as well as the synthetic audio from a text-to-speech (TTS) engine. Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements, allowing us to bootstrap a model for a new language with a fraction of the data that would otherwise be needed. The best system achieved a 46% relative word error rate (WER) reduction compared to the monolingual baseline, among which 25% relative WER improvement is attributed to the post-ASR text-to-text mappings and the TTS synthetic data.

Related papers

Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization [13.222167833914924]
We propose a framework to adapt an autoregressive, multilingual TTS model to new languages.<n>We fine-tune this model on limited paired data of the new languages to capture the target language's prosodic features.<n>Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages.
arXiv Detail & Related papers (2025-09-26T00:28:50Z)
Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
A multilingual training strategy for low resource Text to Speech [5.109810774427171]
We investigate whether data from social media can be used for a small TTS dataset construction, and whether cross lingual transfer learning can work with this type of data. To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language. Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.
arXiv Detail & Related papers (2024-09-02T12:53:01Z)
Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition [31.575930914290762]
Exploiting cross-lingual resources is an effective way to compensate for data scarcity of low resource languages. We extend the concept of learnable cross-lingual mappings for end-to-end speech recognition. The results show that any source language ASR model can be used for a low-resource target language recognition.
arXiv Detail & Related papers (2023-06-14T15:24:31Z)
Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z)
When Is TTS Augmentation Through a Pivot Language Useful? [26.084140117526488]
We propose to produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language. Using several thousand synthetic TTS text-speech pairs and duplicating authentic data to balance yields optimal results. Application of these findings improves ASR by 64.5% and 45.0% character error reduction rate (CERR) respectively for two low-resource languages.
arXiv Detail & Related papers (2022-07-20T13:33:41Z)
Adaptive Activation Network For Low Resource Multilingual Speech Recognition [30.460501537763736]
We introduce an adaptive activation network to the upper layers of ASR model. We also proposed two approaches to train the model: (1) cross-lingual learning, replacing the activation function from source language to target language, and (2) multilingual learning. Our experiments on IARPA Babel datasets demonstrated that our approaches outperform the from-scratch training and traditional bottleneck feature based methods.
arXiv Detail & Related papers (2022-05-28T04:02:59Z)
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z)
Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages. We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z)
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting. Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.