Improving End-To-End Modeling for Mispronunciation Detection with
Effective Augmentation Mechanisms
- URL: http://arxiv.org/abs/2110.08731v1
- Date: Sun, 17 Oct 2021 06:11:15 GMT
- Title: Improving End-To-End Modeling for Mispronunciation Detection with
Effective Augmentation Mechanisms
- Authors: Tien-Hong Lo, Yao-Ting Sung and Berlin Chen
- Abstract summary: We propose two strategies to enhance the discrimination capability of E2E MD models.
One is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model.
The other is label augmentation, which manages to capture more phonological patterns from the transcripts of training data.
- Score: 17.317583079824423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, end-to-end (E2E) models, which allow to take spectral vector
sequences of L2 (second-language) learners' utterances as input and produce the
corresponding phone-level sequences as output, have attracted much research
attention in developing mispronunciation detection (MD) systems. However, due
to the lack of sufficient labeled speech data of L2 speakers for model
estimation, E2E MD models are prone to overfitting in relation to conventional
ones that are built on DNN-HMM acoustic models. To alleviate this critical
issue, we in this paper propose two modeling strategies to enhance the
discrimination capability of E2E MD models, each of which can implicitly
leverage the phonetic and phonological traits encoded in a pretrained acoustic
model and contained within reference transcripts of the training data,
respectively. The first one is input augmentation, which aims to distill
knowledge about phonetic discrimination from a DNN-HMM acoustic model. The
second one is label augmentation, which manages to capture more phonological
patterns from the transcripts of training data. A series of empirical
experiments conducted on the L2-ARCTIC English dataset seem to confirm the
efficacy of our E2E MD model when compared to some top-of-the-line E2E MD
models and a classic pronunciation-scoring based method built on a DNN-HMM
acoustic model.
Related papers
- Enhancing CTC-based speech recognition with diverse modeling units [2.723573795552244]
In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable.
On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model.
We propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units.
arXiv Detail & Related papers (2024-06-05T13:52:55Z) - Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method [11.069975459609829]
We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
arXiv Detail & Related papers (2023-11-13T02:41:41Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Exploring Non-Autoregressive End-To-End Neural Modeling For English
Mispronunciation Detection And Diagnosis [12.153618111267514]
End-to-end (E2E) neural modeling has emerged as one predominant school of thought to develop computer-assisted language training (CAPT) systems.
We present a novel MD&D method that leverages non-autoregressive (NAR) E2E neural modeling to dramatically speed up the inference time.
In addition, we design and develop a pronunciation modeling network stacked on top of the NAR E2E models of our method to further boost the effectiveness of MD&D.
arXiv Detail & Related papers (2021-11-01T11:23:48Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Internal Language Model Estimation for Domain-Adaptive End-to-End Speech
Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition.
We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models.
ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z) - An End-to-End Mispronunciation Detection System for L2 English Speech
Leveraging Novel Anti-Phone Modeling [11.894724235336872]
Mispronunciation detection and diagnosis (MDD) is a core component of computer-assisted pronunciation training (CAPT)
We propose to conduct MDD with a novel end- to-end automatic speech recognition (E2E-based ASR) approach.
In particular, we expand the original L2 phone set with their corresponding anti-phone set, aiming to provide better mispronunciation detection and diagnosis feedback.
arXiv Detail & Related papers (2020-05-25T07:27:47Z) - An Effective End-to-End Modeling Approach for Mispronunciation Detection [12.113290059233977]
We present a novel use of CTCAttention approach to the Mispronunciation detection task.
We also perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task.
A series of Mandarin MD experiments demonstrate that our approach brings about systematic and substantial performance improvements.
arXiv Detail & Related papers (2020-05-18T03:37:21Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.