Experiments of ASR-based mispronunciation detection for children and
adult English learners
- URL: http://arxiv.org/abs/2104.05980v1
- Date: Tue, 13 Apr 2021 07:24:05 GMT
- Title: Experiments of ASR-based mispronunciation detection for children and
adult English learners
- Authors: Nina Hosseini-Kivanani, Roberto Gretter, Marco Matassoni, and Giuseppe
Daniele Falavigna
- Abstract summary: We develop a mispronunciation assessment system that checks the pronunciation of non-native English speakers.
We present an evaluation of the non-native pronunciation observed in phonetically annotated speech corpora.
- Score: 7.083737676329174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pronunciation is one of the fundamentals of language learning, and it is
considered a primary factor of spoken language when it comes to an
understanding and being understood by others. The persistent presence of high
error rates in speech recognition domains resulting from mispronunciations
motivates us to find alternative techniques for handling mispronunciations. In
this study, we develop a mispronunciation assessment system that checks the
pronunciation of non-native English speakers, identifies the commonly
mispronounced phonemes of Italian learners of English, and presents an
evaluation of the non-native pronunciation observed in phonetically annotated
speech corpora. In this work, to detect mispronunciations, we used a
phone-based ASR implemented using Kaldi. We used two non-native English labeled
corpora; (i) a corpus of Italian adults contains 5,867 utterances from 46
speakers, and (ii) a corpus of Italian children consists of 5,268 utterances
from 78 children. Our results show that the selected error model can
discriminate correct sounds from incorrect sounds in both native and nonnative
speech, and therefore can be used to detect pronunciation errors in non-native
speech. The phone error rates show improvement in using the error language
model. The ASR system shows better accuracy after applying the error model on
our selected corpora.
Related papers
- Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS [52.89324095217975]
Previous approaches on accent conversion mainly aimed at making non-native speech sound more native.
We develop a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker.
arXiv Detail & Related papers (2024-10-19T06:12:31Z) - Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method.
A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses.
The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z) - Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of
Speech Sound Disorders in Korean children [4.840474991678558]
This study presents a model of automatic speech recognition designed to diagnose pronunciation issues in children with speech sound disorders.
The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy.
arXiv Detail & Related papers (2024-03-13T02:20:05Z) - DDSupport: Language Learning Support System that Displays Differences
and Distances from Model Speech [16.82591185507251]
We propose a new language learning support system that calculates speech scores and detects mispronunciations by beginners.
The proposed system uses deep learning--based speech processing to display the pronunciation score of the learner's speech and the difference/distance between the learner's and a group of models' pronunciation.
arXiv Detail & Related papers (2022-12-08T05:49:15Z) - Computer-assisted Pronunciation Training -- Speech synthesis is almost
all you need [18.446969150062586]
Existing CAPT methods are not able to detect pronunciation errors with high accuracy.
We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion.
We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.
arXiv Detail & Related papers (2022-07-02T08:33:33Z) - Weakly-supervised word-level pronunciation error detection in non-native
English speech [14.430965595136149]
weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech.
Phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.
Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
arXiv Detail & Related papers (2021-06-07T10:31:53Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Mispronunciation Detection in Non-native (L2) English with Uncertainty
Modeling [13.451106880540326]
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.
We propose a novel approach to overcome this problem based on two principles.
We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18%.
arXiv Detail & Related papers (2021-01-16T08:03:51Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.