An Effective End-to-End Modeling Approach for Mispronunciation Detection
- URL: http://arxiv.org/abs/2005.08440v1
- Date: Mon, 18 May 2020 03:37:21 GMT
- Title: An Effective End-to-End Modeling Approach for Mispronunciation Detection
- Authors: Tien-Hong Lo, Shi-Yan Weng, Hsiu-Jui Chang, and Berlin Chen
- Abstract summary: We present a novel use of CTCAttention approach to the Mispronunciation detection task.
We also perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task.
A series of Mandarin MD experiments demonstrate that our approach brings about systematic and substantial performance improvements.
- Score: 12.113290059233977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, end-to-end (E2E) automatic speech recognition (ASR) systems have
garnered tremendous attention because of their great success and unified
modeling paradigms in comparison to conventional hybrid DNN-HMM ASR systems.
Despite the widespread adoption of E2E modeling frameworks on ASR, there still
is a dearth of work on investigating the E2E frameworks for use in
computer-assisted pronunciation learning (CAPT), particularly for
Mispronunciation detection (MD). In response, we first present a novel use of
hybrid CTCAttention approach to the MD task, taking advantage of the strengths
of both CTC and the attention-based model meanwhile getting around the need for
phone-level forced alignment. Second, we perform input augmentation with text
prompt information to make the resulting E2E model more tailored for the MD
task. On the other hand, we adopt two MD decision methods so as to better
cooperate with the proposed framework: 1) decision-making based on a
recognition confidence measure or 2) simply based on speech recognition
results. A series of Mandarin MD experiments demonstrate that our approach not
only simplifies the processing pipeline of existing hybrid DNN-HMM systems but
also brings about systematic and substantial performance improvements.
Furthermore, input augmentation with text prompts seems to hold excellent
promise for the E2E-based MD approach.
Related papers
- Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Enhancing CTC-based speech recognition with diverse modeling units [2.723573795552244]
In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable.
On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model.
We propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units.
arXiv Detail & Related papers (2024-06-05T13:52:55Z) - Bidirectional Trained Tree-Structured Decoder for Handwritten
Mathematical Expression Recognition [51.66383337087724]
The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR.
Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models.
We propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure.
arXiv Detail & Related papers (2023-12-31T09:24:21Z) - Acoustic Model Fusion for End-to-end Speech Recognition [7.431401982826315]
Speech recognition systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM)
We propose the integration of an external AM into the E2E system to better address the domain mismatch.
We have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets.
arXiv Detail & Related papers (2023-10-10T23:00:17Z) - End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z) - Integrate Lattice-Free MMI into End-to-End Speech Recognition [87.01137882072322]
In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems.
With this motivation, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems.
Previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.
In this work, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI) into E2E
arXiv Detail & Related papers (2022-03-29T14:32:46Z) - Consistent Training and Decoding For End-to-end Speech Recognition Using
Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages.
Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z) - Exploring Non-Autoregressive End-To-End Neural Modeling For English
Mispronunciation Detection And Diagnosis [12.153618111267514]
End-to-end (E2E) neural modeling has emerged as one predominant school of thought to develop computer-assisted language training (CAPT) systems.
We present a novel MD&D method that leverages non-autoregressive (NAR) E2E neural modeling to dramatically speed up the inference time.
In addition, we design and develop a pronunciation modeling network stacked on top of the NAR E2E models of our method to further boost the effectiveness of MD&D.
arXiv Detail & Related papers (2021-11-01T11:23:48Z) - Improving End-To-End Modeling for Mispronunciation Detection with
Effective Augmentation Mechanisms [17.317583079824423]
We propose two strategies to enhance the discrimination capability of E2E MD models.
One is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model.
The other is label augmentation, which manages to capture more phonological patterns from the transcripts of training data.
arXiv Detail & Related papers (2021-10-17T06:11:15Z) - Have best of both worlds: two-pass hybrid and E2E cascading framework
for speech recognition [71.30167252138048]
Hybrid and end-to-end (E2E) systems have different error patterns in the speech recognition results.
This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model.
We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system.
arXiv Detail & Related papers (2021-10-10T20:11:38Z) - Learning Word-Level Confidence For Subword End-to-End ASR [48.09713798451474]
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR)
The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
arXiv Detail & Related papers (2021-03-11T15:03:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.