The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR
Challenge
- URL: http://arxiv.org/abs/2005.08433v2
- Date: Tue, 2 Jun 2020 19:07:08 GMT
- Title: The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR
Challenge
- Authors: Tien-Hong Lo, Fu-An Chao, Shi-Yan Weng, Berlin Chen
- Abstract summary: This paper describes the Interspeech 2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD group of ISCA.
All participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer.
To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models.
- Score: 13.232899176888575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the NTNU ASR system participating in the Interspeech
2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD
group of ISCA. This ASR shared task is made much more challenging due to the
coexisting diversity of non-native and children speaking characteristics. In
the setting of closed-track evaluation, all participants were restricted to
develop their systems merely based on the speech and text corpora provided by
the organizer. To work around this under-resourced issue, we built our ASR
system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the
synergistic power of various data augmentation strategies, including both
utterance- and word-level speed perturbation and spectrogram augmentation,
alongside a simple yet effective data-cleansing approach. All variants of our
ASR system employed an RNN-based language model to rescore the first-pass
recognition hypotheses, which was trained solely on the text dataset released
by the organizer. Our system with the best configuration came out in second
place, resulting in a word error rate (WER) of 17.59 %, while those of the
top-performing, second runner-up and official baseline systems are 15.67%,
18.71%, 35.09%, respectively.
Related papers
- Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems.
TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models.
Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - THUEE system description for NIST 2020 SRE CTS challenge [19.2916501364633]
This paper presents the system description of the THUEE team for the NIST 2020 Speaker Recognition Evaluation (SRE) conversational telephone speech (CTS) challenge.
The subsystems including ResNet74, ResNet152, and RepVGG-B2 are developed as speaker embedding extractors in this evaluation.
arXiv Detail & Related papers (2022-10-12T12:01:59Z) - The 2021 NIST Speaker Recognition Evaluation [1.5282767384702267]
The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996.
This paper presents an overview of SRE21 including the tasks, performance metric, data, evaluation protocol, results and system performance analyses.
arXiv Detail & Related papers (2022-04-21T16:18:52Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - Earnings-21: A Practical Benchmark for ASR in the Wild [4.091202801240259]
We present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors.
We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model.
Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage.
arXiv Detail & Related papers (2021-04-22T23:04:28Z) - Refining Automatic Speech Recognition System for older adults [7.3709604810699085]
We develop an ASR system for socially isolated seniors (80+ years old) with possible cognitive impairments.
We experimentally identify that ASR for the adult population performs poorly on our target population.
We further improve the system by leveraging an attention mechanism to utilize the model's intermediate information.
arXiv Detail & Related papers (2020-11-17T00:00:45Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.