Improving Low Resource Code-switched ASR using Augmented Code-switched
TTS
- URL: http://arxiv.org/abs/2010.05549v1
- Date: Mon, 12 Oct 2020 09:15:12 GMT
- Title: Improving Low Resource Code-switched ASR using Augmented Code-switched
TTS
- Authors: Yash Sharma, Basil Abraham, Karan Taneja, Preethi Jyothi
- Abstract summary: Building Automatic Speech Recognition systems for code-switched speech has recently gained renewed attention.
End-to-end systems require large amounts of labeled speech.
We report significant improvements in ASR performance achieving absolute word error rate (WER) reductions of up to 5%.
- Score: 29.30430160611224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building Automatic Speech Recognition (ASR) systems for code-switched speech
has recently gained renewed attention due to the widespread use of speech
technologies in multilingual communities worldwide. End-to-end ASR systems are
a natural modeling choice due to their ease of use and superior performance in
monolingual settings. However, it is well known that end-to-end systems require
large amounts of labeled speech. In this work, we investigate improving
code-switched ASR in low resource settings via data augmentation using
code-switched text-to-speech (TTS) synthesis. We propose two targeted
techniques to effectively leverage TTS speech samples: 1) Mixup, an existing
technique to create new training samples via linear interpolation of existing
samples, applied to TTS and real speech samples, and 2) a new loss function,
used in conjunction with TTS samples, to encourage code-switched predictions.
We report significant improvements in ASR performance achieving absolute word
error rate (WER) reductions of up to 5%, and measurable improvement in code
switching using our proposed techniques on a Hindi-English code-switched ASR
task.
Related papers
- On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition [31.58289343561422]
We compare five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training.
For data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
arXiv Detail & Related papers (2024-07-31T09:37:27Z) - Making More of Little Data: Improving Low-Resource Automatic Speech
Recognition Using Data Augmentation [20.45373308116162]
This study focuses on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal).
For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system.
We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of
arXiv Detail & Related papers (2023-05-18T13:20:38Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Arabic Code-Switching Speech Recognition using Monolingual Data [13.513655231184261]
Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization.
Recent research in multilingual ASR shows potential improvement over monolingual systems.
We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments.
arXiv Detail & Related papers (2021-07-04T08:40:49Z) - Bootstrap an end-to-end ASR system by multilingual training, transfer
learning, text-to-text mapping and synthetic audio [8.510792628268824]
bootstrapping speech recognition on limited data resources has been an area of active research for long.
We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime.
Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
arXiv Detail & Related papers (2020-11-25T13:11:32Z) - Data Augmentation for End-to-end Code-switching Speech Recognition [54.0507000473827]
Three novel approaches are proposed for code-switching data augmentation.
Audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or word insertion.
Experiments on 200 hours Mandarin-English code-switching dataset show significant improvements on code-switching ASR individually.
arXiv Detail & Related papers (2020-11-04T07:12:44Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.