Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2103.07186v1
- Date: Fri, 12 Mar 2021 10:10:13 GMT
- Title: Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition
- Authors: Aleksandr Laptev, Andrei Andrusenko, Ivan Podluzhny, Anton Mitrofanov,
Ivan Medennikov, Yuri Matveev
- Abstract summary: We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
- Score: 62.94773371761236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of speech assistants, adapting server-intended
automatic speech recognition (ASR) solutions to a direct device has become
crucial. Researchers and industry prefer to use end-to-end ASR systems for
on-device speech recognition tasks. This is because end-to-end systems can be
made resource-efficient while maintaining a higher quality compared to hybrid
systems. However, building end-to-end models requires a significant amount of
speech data. Another challenging task associated with speech assistants is
personalization, which mainly lies in handling out-of-vocabulary (OOV) words.
In this work, we consider building an effective end-to-end ASR system in
low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel
Georgian tasks. To address the aforementioned problems, we propose a method of
dynamic acoustic unit augmentation based on the BPE-dropout technique. It
non-deterministically tokenizes utterances to extend the token's contexts and
to regularize their distribution for the model's recognition of unseen words.
It also reduces the need for optimal subword vocabulary size search. The
technique provides a steady improvement in regular and personalized
(OOV-oriented) speech recognition tasks (at least 6% relative WER and 25%
relative F-score) at no additional computational cost. Owing to the use of
BPE-dropout, our monolingual Turkish Conformer established a competitive result
with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is
close to the best published multilingual system.
Related papers
- GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - A transfer learning based approach for pronunciation scoring [7.98890440106366]
Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators.
Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only.
We present a transfer learning-based approach that leverages a model trained for ASR, adapting it for the task of pronunciation scoring.
arXiv Detail & Related papers (2021-11-01T14:37:06Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Streaming Language Identification using Combination of Acoustic
Representations and ASR Hypotheses [13.976935216584298]
A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel.
We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses.
To reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early.
arXiv Detail & Related papers (2020-06-01T04:08:55Z) - Generative Adversarial Training Data Adaptation for Very Low-resource
Automatic Speech Recognition [31.808145263757105]
We use CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech.
We evaluate this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.
arXiv Detail & Related papers (2020-05-19T07:35:14Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.