Weakly-supervised word-level pronunciation error detection in non-native
English speech
- URL: http://arxiv.org/abs/2106.03494v1
- Date: Mon, 7 Jun 2021 10:31:53 GMT
- Title: Weakly-supervised word-level pronunciation error detection in non-native
English speech
- Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro,
Bozena Kostek
- Abstract summary: weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech.
Phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.
Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
- Score: 14.430965595136149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a weakly-supervised model for word-level mispronunciation
detection in non-native (L2) English speech. To train this model, phonetically
transcribed L2 speech is not required and we only need to mark mispronounced
words. The lack of phonetic transcriptions for L2 speech means that the model
has to learn only from a weak signal of word-level mispronunciations. Because
of that and due to the limited amount of mispronounced L2 speech, the model is
more likely to overfit. To limit this risk, we train it in a multi-task setup.
In the first task, we estimate the probabilities of word-level
mispronunciation. For the second task, we use a phoneme recognizer trained on
phonetically transcribed L1 speech that is easily accessible and can be
automatically annotated. Compared to state-of-the-art approaches, we improve
the accuracy of detecting word-level pronunciation errors in AUC metric by 30%
on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus
of L2 German and Italian speakers.
Related papers
- Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - BiPhone: Modeling Inter Language Phonetic Influences in Text [12.405907573933378]
A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries.
Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1)
We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2.
These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text.
arXiv Detail & Related papers (2023-07-06T22:31:55Z) - Incorporating L2 Phonemes Using Articulatory Features for Robust Speech
Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis.
We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates.
Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z) - On the Off-Target Problem of Zero-Shot Multilingual Neural Machine
Translation [104.85258654917297]
We find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance.
We propose Language Aware Vocabulary Sharing (LAVS) to construct the multilingual vocabulary.
We conduct experiments on a multilingual machine translation benchmark in 11 languages.
arXiv Detail & Related papers (2023-05-18T12:43:31Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - Computer-assisted Pronunciation Training -- Speech synthesis is almost
all you need [18.446969150062586]
Existing CAPT methods are not able to detect pronunciation errors with high accuracy.
We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion.
We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.
arXiv Detail & Related papers (2022-07-02T08:33:33Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Experiments of ASR-based mispronunciation detection for children and
adult English learners [7.083737676329174]
We develop a mispronunciation assessment system that checks the pronunciation of non-native English speakers.
We present an evaluation of the non-native pronunciation observed in phonetically annotated speech corpora.
arXiv Detail & Related papers (2021-04-13T07:24:05Z) - Mispronunciation Detection in Non-native (L2) English with Uncertainty
Modeling [13.451106880540326]
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.
We propose a novel approach to overcome this problem based on two principles.
We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18%.
arXiv Detail & Related papers (2021-01-16T08:03:51Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.