Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling
- URL: http://arxiv.org/abs/2509.20396v1
- Date: Tue, 23 Sep 2025 12:54:30 GMT
- Title: Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling
- Authors: Niclas Pokel, Pehuén Moure, Roman Boehringer, Yingqiang Gao,
- Abstract summary: This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning.<n>We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult.<n>Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.
- Score: 2.0128859854921743
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning. We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult and use these estimates for a targeted oversampling strategy. We validate our method on English and German datasets. Crucially, we demonstrate that our model-derived uncertainty strongly correlates with phonemes identified as challenging in an expert clinical logopedic report, marking, to our knowledge, the first work to successfully align model uncertainty with expert assessment of speech difficulty. Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.
Related papers
- Training-Free Intelligibility-Guided Observation Addition for Noisy ASR [57.74127683005929]
This paper proposes an intelligibility-guided observation addition (OA) method to improve speech recognition in noisy environments.<n>Experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines.
arXiv Detail & Related papers (2026-02-24T14:46:54Z) - Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels [6.254549196597175]
We propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels.<n>During training, we employ LLM-generated predictions on unlabeled data by using grammar competency-based prompts.<n>We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy.
arXiv Detail & Related papers (2025-11-17T09:00:26Z) - Hallucination Benchmark for Speech Foundation Models [33.92968426403491]
Hallucinations in automatic speech recognition (ASR) systems refer to fluent and coherent transcriptions produced by neural ASR models that are completely unrelated to the underlying acoustic input (i.e., the speech signal)<n>This apparent coherence can mislead subsequent processing stages and introduce serious risks, particularly in critical domains such as healthcare and law.<n>We introduce SHALLOW, the first benchmark framework that systematically categorizes and quantifies hallucination phenomena in ASR along four complementary axes: lexical, phonetic, morphological, and semantic.
arXiv Detail & Related papers (2025-10-18T16:26:16Z) - Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition [8.838919369202525]
Speech impairments resulting from congenital disorders present major challenges to automatic speech recognition systems.<n>State-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability.<n>This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning.
arXiv Detail & Related papers (2025-09-23T13:44:58Z) - Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech [0.562479170374811]
Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition systems.<n>We propose a practical and lightweight pipeline to personalize ASR models, formalizing the selection of words and enriching a small, speech-impaired dataset with semantic coherence.<n>Our approach shows promising improvements in transcription quality, demonstrating the potential to reduce communication barriers for individuals with atypical speech patterns.
arXiv Detail & Related papers (2025-06-23T15:30:50Z) - FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning [9.960675988638805]
We propose a novel framework called fake audio detection with evidential learning (FADEL)<n>FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios.<n>We demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
arXiv Detail & Related papers (2025-04-22T07:40:35Z) - Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems.
We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains.
STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z) - Toward Practical Automatic Speech Recognition and Post-Processing: a
Call for Explainable Error Benchmark Guideline [12.197453599489963]
We propose the development of an Error Explainable Benchmark (EEB) dataset.
This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings.
Our proposition provides a structured pathway for a more real-world-centric' evaluation, allowing for the detection and rectification of nuanced system weaknesses.
arXiv Detail & Related papers (2024-01-26T03:42:45Z) - Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites:
A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area.
We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions.
We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z) - Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models [2.4654745083407175]
We propose a new multi-rounds adaptation process that uses uncertainty to automate the annotation process.<n>This novel method streamlines data annotation and strategically selects data samples contributing most to model uncertainty.<n>Our results show that our approach leads to a 27% WER relative average improvement while requiring on average 45% less data than established baselines.
arXiv Detail & Related papers (2023-06-03T13:11:37Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Exploiting Sample Uncertainty for Domain Adaptive Person
Re-Identification [137.9939571408506]
We estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels.
Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-12-16T04:09:04Z) - Unsupervised Domain Adaptation for Speech Recognition via Uncertainty
Driven Self-Training [55.824641135682725]
Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD show that up to 80% of the performance of a system trained on ground-truth data can be recovered.
arXiv Detail & Related papers (2020-11-26T18:51:26Z) - A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.