L1-aware Multilingual Mispronunciation Detection Framework
- URL: http://arxiv.org/abs/2309.07719v2
- Date: Thu, 21 Sep 2023 15:26:33 GMT
- Title: L1-aware Multilingual Mispronunciation Detection Framework
- Authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali
- Abstract summary: This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation.
An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence.
Experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets.
- Score: 10.15106073866792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The phonological discrepancies between a speaker's native (L1) and the
non-native language (L2) serves as a major factor for mispronunciation. This
paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched
with L1-aware speech representation. An end-to-end speech encoder is trained on
the input signal and its corresponding reference phoneme sequence. First, an
attention mechanism is deployed to align the input audio with the reference
phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an
auxiliary model, pretrained in a multi-task setup identifying L1 and L2
language, and are infused with the primary network. Finally, the L1-MultiMDD is
then optimized for a unified multilingual phoneme recognition task using
connectionist temporal classification (CTC) loss for the target languages:
English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of
the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and
AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent
gains in PER, and false rejection rate (FRR) across all target languages
confirm our approach's robustness, efficacy, and generalizability.
Related papers
- Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision [16.992058149317753]
This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient automatic speech recognition (MCLASR)
We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models.
Experiments demonstrate the advantages of phoneme-based models for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency.
arXiv Detail & Related papers (2024-06-04T09:56:05Z) - BiPhone: Modeling Inter Language Phonetic Influences in Text [12.405907573933378]
A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries.
Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1)
We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2.
These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text.
arXiv Detail & Related papers (2023-07-06T22:31:55Z) - Incorporating L2 Phonemes Using Articulatory Features for Robust Speech
Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis.
We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates.
Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - Unsupervised Acoustic Unit Discovery by Leveraging a
Language-Independent Subword Discriminative Feature Representation [31.87235700253597]
This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data.
We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units.
arXiv Detail & Related papers (2021-04-02T11:43:07Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.