Mispronunciation Detection in Non-native (L2) English with Uncertainty
Modeling
- URL: http://arxiv.org/abs/2101.06396v2
- Date: Mon, 8 Feb 2021 20:16:47 GMT
- Title: Mispronunciation Detection in Non-native (L2) English with Uncertainty
Modeling
- Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira
Calamaro, Thomas Drugman, Bozena Kostek
- Abstract summary: A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.
We propose a novel approach to overcome this problem based on two principles.
We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18%.
- Score: 13.451106880540326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common approach to the automatic detection of mispronunciation in language
learning is to recognize the phonemes produced by a student and compare it to
the expected pronunciation of a native speaker. This approach makes two
simplifying assumptions: a) phonemes can be recognized from speech with high
accuracy, b) there is a single correct way for a sentence to be pronounced.
These assumptions do not always hold, which can result in a significant amount
of false mispronunciation alarms. We propose a novel approach to overcome this
problem based on two principles: a) taking into account uncertainty in the
automatic phoneme recognition step, b) accounting for the fact that there may
be multiple valid pronunciations. We evaluate the model on non-native (L2)
English speech of German, Italian and Polish speakers, where it is shown to
increase the precision of detecting mispronunciations by up to 18% (relative)
compared to the common approach.
Related papers
- Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z) - Incorporating L2 Phonemes Using Articulatory Features for Robust Speech
Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis.
We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates.
Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z) - Cross-Lingual Speaker Identification Using Distant Supervision [84.51121411280134]
We propose a speaker identification framework that addresses issues such as lack of contextual reasoning and poor cross-lingual generalization.
We show that the resulting model outperforms previous state-of-the-art methods on two English speaker identification benchmarks by up to 9% in accuracy and 5% with only distant supervision.
arXiv Detail & Related papers (2022-10-11T20:49:44Z) - Computer-assisted Pronunciation Training -- Speech synthesis is almost
all you need [18.446969150062586]
Existing CAPT methods are not able to detect pronunciation errors with high accuracy.
We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion.
We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.
arXiv Detail & Related papers (2022-07-02T08:33:33Z) - Towards End-to-end Unsupervised Speech Recognition [120.4915001021405]
We introduce wvu which does away with all audio-side pre-processing and improves accuracy through better architecture.
In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input.
Experiments show that wvuimproves unsupervised recognition results across different languages while being conceptually simpler.
arXiv Detail & Related papers (2022-04-05T21:22:38Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Weakly-supervised word-level pronunciation error detection in non-native
English speech [14.430965595136149]
weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech.
Phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.
Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
arXiv Detail & Related papers (2021-06-07T10:31:53Z) - Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in
German Speech Recognition [1.3381749415517017]
Anglicisms are a challenge in German speech recognition due to irregular pronunciation compared to native German words.
We propose a multitask sequence-to-sequence approach for grapheme-to-phoneme conversion to improve the phonetization of Anglicisms.
We show that multitask learning can help solving the challenge of loanwords in German speech recognition.
arXiv Detail & Related papers (2021-05-26T17:42:13Z) - Experiments of ASR-based mispronunciation detection for children and
adult English learners [7.083737676329174]
We develop a mispronunciation assessment system that checks the pronunciation of non-native English speakers.
We present an evaluation of the non-native pronunciation observed in phonetically annotated speech corpora.
arXiv Detail & Related papers (2021-04-13T07:24:05Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.