Low Resource German ASR with Untranscribed Data Spoken by Non-native
Children -- INTERSPEECH 2021 Shared Task SPAPL System
- URL: http://arxiv.org/abs/2106.09963v1
- Date: Fri, 18 Jun 2021 07:36:26 GMT
- Title: Low Resource German ASR with Untranscribed Data Spoken by Non-native
Children -- INTERSPEECH 2021 Shared Task SPAPL System
- Authors: Jinhan Wang, Yunzheng Zhu, Ruchao Fan, Wei Chu, Abeer Alwan
- Abstract summary: This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German.
5 hours of transcribed data and 60 hours of untranscribed data are provided to develop a German ASR system for children.
For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances.
Our system achieves a word error rate (WER) of 39.68% on the evaluation data,
- Score: 19.435571932141364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge:
Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in
German. ~ 5 hours of transcribed data and ~ 60 hours of untranscribed data are
provided to develop a German ASR system for children. For the training of the
transcribed data, we propose a non-speech state discriminative loss (NSDL) to
mitigate the influence of long-duration non-speech segments within speech
utterances. In order to explore the use of the untranscribed data, various
approaches are implemented and combined together to incrementally improve the
system performance. First, bidirectional autoregressive predictive coding
(Bi-APC) is used to learn initial parameters for acoustic modelling using the
provided untranscribed data. Second, incremental semi-supervised learning is
further used to iteratively generate pseudo-transcribed data. Third, different
data augmentation schemes are used at different training stages to increase the
variability and size of the training data. Finally, a recurrent neural network
language model (RNNLM) is used for rescoring. Our system achieves a word error
rate (WER) of 39.68% on the evaluation data, an approximately 12% relative
improvement over the official baseline (45.21%).
Related papers
- Cross-lingual Knowledge Transfer and Iterative Pseudo-labeling for
Low-Resource Speech Recognition with Transducers [6.017182111335404]
Cross-lingual knowledge transfer and iterative pseudo-labeling are two techniques that have been shown to be successful for improving the accuracy of ASR systems.
We show that the Transducer system trained using transcripts produced by the hybrid system achieves 18% reduction in terms of word error rate.
arXiv Detail & Related papers (2023-05-23T03:50:35Z) - Making More of Little Data: Improving Low-Resource Automatic Speech
Recognition Using Data Augmentation [20.45373308116162]
This study focuses on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal).
For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system.
We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of
arXiv Detail & Related papers (2023-05-18T13:20:38Z) - Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation [32.885722714728765]
We investigate data augmentation techniques for code-switching (CS) NLP systems.
We perform lexical replacements using word-aligned parallel corpora.
We compare these approaches against dictionary-based replacements.
arXiv Detail & Related papers (2022-05-25T10:44:36Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis [87.75833205560406]
This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system.
It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden.
arXiv Detail & Related papers (2021-10-09T07:00:38Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Bootstrap an end-to-end ASR system by multilingual training, transfer
learning, text-to-text mapping and synthetic audio [8.510792628268824]
bootstrapping speech recognition on limited data resources has been an area of active research for long.
We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime.
Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
arXiv Detail & Related papers (2020-11-25T13:11:32Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.