An End-to-End Mispronunciation Detection System for L2 English Speech
Leveraging Novel Anti-Phone Modeling
- URL: http://arxiv.org/abs/2005.11950v2
- Date: Fri, 28 Aug 2020 07:41:49 GMT
- Title: An End-to-End Mispronunciation Detection System for L2 English Speech
Leveraging Novel Anti-Phone Modeling
- Authors: Bi-Cheng Yan, Meng-Che Wu, Hsiao-Tsung Hung, Berlin Chen
- Abstract summary: Mispronunciation detection and diagnosis (MDD) is a core component of computer-assisted pronunciation training (CAPT)
We propose to conduct MDD with a novel end- to-end automatic speech recognition (E2E-based ASR) approach.
In particular, we expand the original L2 phone set with their corresponding anti-phone set, aiming to provide better mispronunciation detection and diagnosis feedback.
- Score: 11.894724235336872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mispronunciation detection and diagnosis (MDD) is a core component of
computer-assisted pronunciation training (CAPT). Most of the existing MDD
approaches focus on dealing with categorical errors (viz. one canonical phone
is substituted by another one, aside from those mispronunciations caused by
deletions or insertions). However, accurate detection and diagnosis of
non-categorial or distortion errors (viz. approximating L2 phones with L1
(first-language) phones, or erroneous pronunciations in between) still seems
out of reach. In view of this, we propose to conduct MDD with a novel end-
to-end automatic speech recognition (E2E-based ASR) approach. In particular, we
expand the original L2 phone set with their corresponding anti-phone set,
making the E2E-based MDD approach have a better capability to take in both
categorical and non-categorial mispronunciations, aiming to provide better
mispronunciation detection and diagnosis feedback. Furthermore, a novel
transfer-learning paradigm is devised to obtain the initial model estimate of
the E2E-based MDD system without resource to any phonological rules. Extensive
sets of experimental results on the L2-ARCTIC dataset show that our best system
can outperform the existing E2E baseline system and pronunciation scoring based
method (GOP) in terms of the F1-score, by 11.05% and 27.71%, respectively.
Related papers
- It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method [11.069975459609829]
We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
arXiv Detail & Related papers (2023-11-13T02:41:41Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Improving Mispronunciation Detection with Wav2vec2-based Momentum
Pseudo-Labeling for Accentedness and Intelligibility Assessment [28.76055994423364]
Current mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition.
One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech.
We leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models.
arXiv Detail & Related papers (2022-03-29T22:40:31Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Improving End-To-End Modeling for Mispronunciation Detection with
Effective Augmentation Mechanisms [17.317583079824423]
We propose two strategies to enhance the discrimination capability of E2E MD models.
One is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model.
The other is label augmentation, which manages to capture more phonological patterns from the transcripts of training data.
arXiv Detail & Related papers (2021-10-17T06:11:15Z) - An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings [18.282632348274756]
Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech.
We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system.
arXiv Detail & Related papers (2021-10-14T11:25:02Z) - Learning Word-Level Confidence For Subword End-to-End ASR [48.09713798451474]
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR)
The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
arXiv Detail & Related papers (2021-03-11T15:03:33Z) - An Effective End-to-End Modeling Approach for Mispronunciation Detection [12.113290059233977]
We present a novel use of CTCAttention approach to the Mispronunciation detection task.
We also perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task.
A series of Mandarin MD experiments demonstrate that our approach brings about systematic and substantial performance improvements.
arXiv Detail & Related papers (2020-05-18T03:37:21Z) - Wake Word Detection with Alignment-Free Lattice-Free MMI [66.12175350462263]
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input.
We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data.
We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures.
arXiv Detail & Related papers (2020-05-17T19:22:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.