Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method
- URL: http://arxiv.org/abs/2311.07037v1
- Date: Mon, 13 Nov 2023 02:41:41 GMT
- Title: Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method
- Authors: Mostafa Shahin, Julien Epps, Beena Ahmed
- Abstract summary: We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
- Score: 11.069975459609829
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The automatic identification and analysis of pronunciation errors, known as
Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer
Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning
or speech therapy applications. Existing MDD methods relying on analysing
phonemes can only detect categorical errors of phonemes that have an adequate
amount of training data to be modelled. With the unpredictable nature of the
pronunciation errors of non-native or disordered speakers and the scarcity of
training datasets, it is unfeasible to model all types of mispronunciations.
Moreover, phoneme-level MDD approaches have a limited ability to provide
detailed diagnostic information about the error made. In this paper, we propose
a low-level MDD approach based on the detection of speech attribute features.
Speech attribute features break down phoneme production into elementary
components that are directly related to the articulatory system leading to more
formative feedback to the learner. We further propose a multi-label variant of
the Connectionist Temporal Classification (CTC) approach to jointly model the
non-mutually exclusive speech attributes using a single model. The pre-trained
wav2vec2 model was employed as a core model for the speech attribute detector.
The proposed method was applied to L2 speech corpora collected from English
learners from different native languages. The proposed speech attribute MDD
method was further compared to the traditional phoneme-level MDD and achieved a
significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR),
and Diagnostic Error Rate (DER) over all speech attributes compared to the
phoneme-level equivalent.
Related papers
- It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Preserving Phonemic Distinctions for Ordinal Regression: A Novel Loss
Function for Automatic Pronunciation Assessment [10.844822448167937]
We propose a phonemic contrast ordinal (PCO) loss for training regression-based APA models.
Specifically, we introduce a phoneme-distinct regularizer into the MSE loss, which encourages feature representations of different phoneme categories to be far apart.
An extensive set of experiments carried out on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our model.
arXiv Detail & Related papers (2023-10-03T07:05:37Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Improving Mispronunciation Detection with Wav2vec2-based Momentum
Pseudo-Labeling for Accentedness and Intelligibility Assessment [28.76055994423364]
Current mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition.
One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech.
We leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models.
arXiv Detail & Related papers (2022-03-29T22:40:31Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings [18.282632348274756]
Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech.
We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system.
arXiv Detail & Related papers (2021-10-14T11:25:02Z) - Multi-Modal Detection of Alzheimer's Disease from Speech and Text [3.702631194466718]
We propose a deep learning method that utilizes speech and the corresponding transcript simultaneously to detect Alzheimer's disease (AD)
The proposed method achieves 85.3% 10-fold cross-validation accuracy when trained and evaluated on the Dementiabank Pitt corpus.
arXiv Detail & Related papers (2020-11-30T21:18:17Z) - An End-to-End Mispronunciation Detection System for L2 English Speech
Leveraging Novel Anti-Phone Modeling [11.894724235336872]
Mispronunciation detection and diagnosis (MDD) is a core component of computer-assisted pronunciation training (CAPT)
We propose to conduct MDD with a novel end- to-end automatic speech recognition (E2E-based ASR) approach.
In particular, we expand the original L2 phone set with their corresponding anti-phone set, aiming to provide better mispronunciation detection and diagnosis feedback.
arXiv Detail & Related papers (2020-05-25T07:27:47Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.