An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings
- URL: http://arxiv.org/abs/2110.07274v1
- Date: Thu, 14 Oct 2021 11:25:02 GMT
- Title: An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings
- Authors: Wenxuan Ye, Shaoguang Mao, Frank Soong, Wenshan Wu, Yan Xia, Jonathan
Tien, Zhiyong Wu
- Abstract summary: Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech.
We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system.
- Score: 18.282632348274756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many mispronunciation detection and diagnosis (MD&D) research approaches try
to exploit both the acoustic and linguistic features as input. Yet the
improvement of the performance is limited, partially due to the shortage of
large amount annotated training data at the phoneme level. Phonetic embeddings,
extracted from ASR models trained with huge amount of word level annotations,
can serve as a good representation of the content of input speech, in a
noise-robust and speaker-independent manner. These embeddings, when used as
implicit phonetic supplementary information, can alleviate the data shortage of
explicit phoneme annotations. We propose to utilize Acoustic, Phonetic and
Linguistic (APL) embedding features jointly for building a more powerful MD\&D
system. Experimental results obtained on the L2-ARCTIC database show the
proposed approach outperforms the baseline by 9.93%, 10.13% and 6.17% on the
detection accuracy, diagnosis error rate and the F-measure, respectively.
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features [0.353122873734926]
Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity.
Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with Expert Defined Linguistic Features (EDLFs)
It is established that there is an improvement in several deepfake detection algorithms when they augmented the traditional and common features of audio data with EDLFs.
arXiv Detail & Related papers (2024-09-09T19:47:57Z) - Large Language Models for Dysfluency Detection in Stuttered Speech [16.812800649507302]
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components.
Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, we approach the task of multi-label dysfluency detection as a language modeling problem.
We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech.
arXiv Detail & Related papers (2024-06-16T17:51:22Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method [11.069975459609829]
We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
arXiv Detail & Related papers (2023-11-13T02:41:41Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Unsupervised Multimodal Word Discovery based on Double Articulation
Analysis with Co-occurrence cues [7.332652485849632]
Human infants acquire their verbal lexicon with minimal prior knowledge of language.
This study proposes a novel fully unsupervised learning method for discovering speech units.
The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
arXiv Detail & Related papers (2022-01-18T07:31:59Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.