Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach
- URL: http://arxiv.org/abs/2511.20107v1
- Date: Tue, 25 Nov 2025 09:26:34 GMT
- Title: Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach
- Authors: Huu Tuong Tu, Ha Viet Khanh, Tran Tien Dat, Vu Huan, Thien Van Luong, Nguyen Tien Cuong, Nguyen Thi Thu Trang,
- Abstract summary: We propose a training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model.<n>Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors.
- Score: 4.676867532704908
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.
Related papers
- Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation [12.39451124683428]
We propose a latent variable model based method, with phonemes being treated as discrete latent variables.<n>Based on a multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish and Indonesian.<n>With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5% error rate reductions.
arXiv Detail & Related papers (2025-07-04T12:23:22Z) - Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM [9.84961079811343]
Speech has gained attention as a non-invasive and easily collectible biomarker for assessing cognitive decline.<n>Traditional cognitive impairment detection methods rely on supervised models trained on acoustic and linguistic features extracted from speech.<n>We propose the first zero-shot speech-based CI detection method using the Qwen2- Audio AudioLLM, a model capable of processing both audio and text inputs.
arXiv Detail & Related papers (2025-06-20T01:28:43Z) - The Surprising Effectiveness of Test-Time Training for Few-Shot Learning [59.309477460893916]
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks.<n>We investigate the effectiveness of test-time training (TTT) as a mechanism for improving LMs' reasoning and few-shot learning capabilities.<n>Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Seq2seq for Automatic Paraphasia Detection in Aphasic Speech [14.686874756530322]
Paraphasias are speech errors that are characteristic of aphasia and represent an important signal in assessing disease severity and subtype.
Traditionally, clinicians manually identify paraphasias by transcribing and analyzing speech-language samples.
We propose a novel, sequence-to-sequence (seq2seq) model that is trained end-to-end (E2E) to perform both ASR and paraphasia detection tasks.
arXiv Detail & Related papers (2023-12-16T18:22:37Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method [11.069975459609829]
We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
arXiv Detail & Related papers (2023-11-13T02:41:41Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - Exploration and Exploitation: Two Ways to Improve Chinese Spelling
Correction Models [51.744357472072416]
We propose a method, which continually identifies the weak spots of a model to generate more valuable training instances.
Experimental results show that such an adversarial training method combined with the pretraining strategy can improve both the generalization and robustness of multiple CSC models.
arXiv Detail & Related papers (2021-05-31T09:17:33Z) - Multi-Modal Detection of Alzheimer's Disease from Speech and Text [3.702631194466718]
We propose a deep learning method that utilizes speech and the corresponding transcript simultaneously to detect Alzheimer's disease (AD)
The proposed method achieves 85.3% 10-fold cross-validation accuracy when trained and evaluated on the Dementiabank Pitt corpus.
arXiv Detail & Related papers (2020-11-30T21:18:17Z) - Wake Word Detection with Alignment-Free Lattice-Free MMI [66.12175350462263]
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input.
We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data.
We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures.
arXiv Detail & Related papers (2020-05-17T19:22:25Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.