Related papers: Exploring Non-Autoregressive End-To-End Neural Modeling For English Mispronunciation Detection And Diagnosis

Exploring Non-Autoregressive End-To-End Neural Modeling For English Mispronunciation Detection And Diagnosis

URL: http://arxiv.org/abs/2111.00844v1
Date: Mon, 1 Nov 2021 11:23:48 GMT
Title: Exploring Non-Autoregressive End-To-End Neural Modeling For English Mispronunciation Detection And Diagnosis
Authors: Hsin-Wei Wang, Bi-Cheng Yan, Hsuan-Sheng Chiu, Yung-Chang Hsu, Berlin Chen
Abstract summary: End-to-end (E2E) neural modeling has emerged as one predominant school of thought to develop computer-assisted language training (CAPT) systems. We present a novel MD&D method that leverages non-autoregressive (NAR) E2E neural modeling to dramatically speed up the inference time. In addition, we design and develop a pronunciation modeling network stacked on top of the NAR E2E models of our method to further boost the effectiveness of MD&D.
Score: 12.153618111267514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end (E2E) neural modeling has emerged as one predominant school of thought to develop computer-assisted language training (CAPT) systems, showing competitive performance to conventional pronunciation-scoring based methods. However, current E2E neural methods for CAPT are faced with at least two pivotal challenges. On one hand, most of the E2E methods operate in an autoregressive manner with left-to-right beam search to dictate the pronunciations of an L2 learners. This however leads to very slow inference speed, which inevitably hinders their practical use. On the other hand, E2E neural methods are normally data greedy and meanwhile an insufficient amount of nonnative training data would often reduce their efficacy on mispronunciation detection and diagnosis (MD&D). In response, we put forward a novel MD&D method that leverages non-autoregressive (NAR) E2E neural modeling to dramatically speed up the inference time while maintaining performance in line with the conventional E2E neural methods. In addition, we design and develop a pronunciation modeling network stacked on top of the NAR E2E models of our method to further boost the effectiveness of MD&D. Empirical experiments conducted on the L2-ARCTIC English dataset seems to validate the feasibility of our method, in comparison to some top-of-the-line E2E models and an iconic pronunciation-scoring based method built on a DNN-HMM acoustic model.

Related papers

EEGDiR: Electroencephalogram denoising network for temporal information storage and global modeling through Retentive Network [11.491355463353731]
We introduce the Retnet from natural language processing to EEG denoising. Direct application of Retnet to EEG denoising is unfeasible due to the one-dimensional nature of EEG signals. We propose the signal embedding method, transforming one-dimensional EEG signals into two dimensions for use as network inputs.
arXiv Detail & Related papers (2024-03-20T15:04:21Z)
DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models. DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences. We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z)
A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants. While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem. In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z)
Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms [17.317583079824423]
We propose two strategies to enhance the discrimination capability of E2E MD models. One is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model. The other is label augmentation, which manages to capture more phonological patterns from the transcripts of training data.
arXiv Detail & Related papers (2021-10-17T06:11:15Z)
Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models [1.6328866317851185]
A deep neural network (DNN)-based speech enhancement (SE) is proposed in this paper. Our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM) Experimental results show that our method improved CER by 7.3% relative derived through a black-box AM although certain noise levels are kept.
arXiv Detail & Related papers (2021-10-12T12:51:53Z)
Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction. It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition. We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z)
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition. We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models. ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z)
Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder. The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead. Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z)
An Effective End-to-End Modeling Approach for Mispronunciation Detection [12.113290059233977]
We present a novel use of CTCAttention approach to the Mispronunciation detection task. We also perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task. A series of Mandarin MD experiments demonstrate that our approach brings about systematic and substantial performance improvements.
arXiv Detail & Related papers (2020-05-18T03:37:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.