Related papers: Weakly-supervised word-level pronunciation error detection in non-native English speech

Weakly-supervised word-level pronunciation error detection in non-native English speech

URL: http://arxiv.org/abs/2106.03494v1
Date: Mon, 7 Jun 2021 10:31:53 GMT
Title: Weakly-supervised word-level pronunciation error detection in non-native English speech
Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek
Abstract summary: weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. Phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
Score: 14.430965595136149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.

Related papers

Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback [28.53752312060031]
Speak & Improve Corpus 2025 is a dataset of L2 learner English data with holistic scores and language error annotation. The aim of the corpus release is to address a major challenge to developing L2 spoken language processing systems. It is being made available for non-commercial use on the ELiT website.
arXiv Detail & Related papers (2024-12-16T17:07:26Z)
Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation [1.3024517678456733]
learners of a second language (L2) often unconsciously substitute unfamiliar L2 phonemes with similar phonemes from their native language (L1) This phonemic substitution leads to deviations from the standard phonological patterns of the L2. We propose Inter-linguistic Phonetic Composition (IPC), a novel computational method designed to minimize incorrect phonological transfer.
arXiv Detail & Related papers (2024-11-17T01:15:58Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems. We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis. We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates. Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z)
On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation [104.85258654917297]
We find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance. We propose Language Aware Vocabulary Sharing (LAVS) to construct the multilingual vocabulary. We conduct experiments on a multilingual machine translation benchmark in 11 languages.
arXiv Detail & Related papers (2023-05-18T12:43:31Z)
Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks. We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT) We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z)
Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need [18.446969150062586]
Existing CAPT methods are not able to detect pronunciation errors with high accuracy. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.
arXiv Detail & Related papers (2022-07-02T08:33:33Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z)
Experiments of ASR-based mispronunciation detection for children and adult English learners [7.083737676329174]
We develop a mispronunciation assessment system that checks the pronunciation of non-native English speakers. We present an evaluation of the non-native pronunciation observed in phonetically annotated speech corpora.
arXiv Detail & Related papers (2021-04-13T07:24:05Z)
Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling [13.451106880540326]
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. We propose a novel approach to overcome this problem based on two principles. We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18%.
arXiv Detail & Related papers (2021-01-16T08:03:51Z)
Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data. Our model is able to recognize unseen phonemes in the target language without any training data. It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.