Improving low-resource ASR performance with untranscribed out-of-domain
data
- URL: http://arxiv.org/abs/2106.01227v1
- Date: Wed, 2 Jun 2021 15:23:34 GMT
- Title: Improving low-resource ASR performance with untranscribed out-of-domain
data
- Authors: Jayadev Billa
- Abstract summary: Semi-supervised training (SST) is a common approach to leverage untranscribed/unlabeled speech data.
We look to improve performance on conversational/telephony speech (target domain) using web resources.
- Score: 8.376091455761259
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semi-supervised training (SST) is a common approach to leverage
untranscribed/unlabeled speech data to improve automatic speech recognition
performance in low-resource languages. However, if the available unlabeled
speech is mismatched to the target domain, SST is not as effective, and in many
cases performs worse than the original system. In this paper, we address the
issue of low-resource ASR when only untranscribed out-of-domain speech data is
readily available in the target language. Specifically, we look to improve
performance on conversational/telephony speech (target domain) using web
resources, in particular YouTube data, which more closely resembles
news/topical broadcast data. Leveraging SST, we show that while in some cases
simply pooling the out-of-domain data with the training data lowers word error
rate (WER), in all cases, we see improvements if we train first with the
out-of-domain data and then fine-tune the resulting model with the original
training data. Using 2000 hours of speed perturbed YouTube audio in each target
language, with semi-supervised transcripts, we show improvements on multiple
languages/data sets, of up to 16.3% relative improvement in WER over the
baseline systems and up to 7.4% relative improvement in WER over a system that
simply pools the out-of-domain data with the training data.
Related papers
- Replay to Remember: Continual Layer-Specific Fine-tuning for German
Speech Recognition [19.635428830237842]
We study how well the performance of large-scale ASR models can be approximated for smaller domains.
We apply Experience Replay for continual learning to increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain.
arXiv Detail & Related papers (2023-07-14T11:20:22Z) - OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
Alignment [57.15449072423539]
We propose a training system Open-modality Speech Recognition (textbfOpenSR)
OpenSR enables modality transfer from one to any in three different settings.
It achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
arXiv Detail & Related papers (2023-06-10T11:04:10Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Making More of Little Data: Improving Low-Resource Automatic Speech
Recognition Using Data Augmentation [20.45373308116162]
This study focuses on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal).
For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system.
We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of
arXiv Detail & Related papers (2023-05-18T13:20:38Z) - Improving Accented Speech Recognition with Multi-Domain Training [2.28438857884398]
We use speech audio representing four different French accents to create fine-tuning datasets that improve the robustness of pre-trained ASR models.
Our numerical experiments show that we can reduce error rates by up to 25% (relative) on African and Belgian accents.
arXiv Detail & Related papers (2023-03-14T14:10:16Z) - Deploying self-supervised learning in the wild for hybrid automatic
speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR [10.261890123213622]
We propose an on-the-fly data augmentation method for automatic speech recognition (ASR)
Our method, called Aligned Data Augmentation (ADA) for ASR, replaces transcribed tokens and the speech representations in an aligned manner to generate training pairs.
arXiv Detail & Related papers (2021-04-03T13:00:00Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.