ASTER: Automatic Speech Recognition System Accessibility Testing for
Stutterers
- URL: http://arxiv.org/abs/2308.15742v1
- Date: Wed, 30 Aug 2023 03:46:52 GMT
- Title: ASTER: Automatic Speech Recognition System Accessibility Testing for
Stutterers
- Authors: Yi Liu, Yuekang Li, Gelei Deng, Felix Juefei-Xu, Yao Du, Cen Zhang,
Chengwei Liu, Yeting Li, Lei Ma and Yang Liu
- Abstract summary: We propose ASTER, a technique for automatically testing the accessibility of ASR systems.
ASTER generates valid test cases by injecting five different types of stuttering.
It significantly increases the word error rate, match error rate, and word information loss in the evaluated ASR systems.
- Score: 25.466850759460364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The popularity of automatic speech recognition (ASR) systems nowadays leads
to an increasing need for improving their accessibility. Handling stuttering
speech is an important feature for accessible ASR systems. To improve the
accessibility of ASR systems for stutterers, we need to expose and analyze the
failures of ASR systems on stuttering speech. The speech datasets recorded from
stutterers are not diverse enough to expose most of the failures. Furthermore,
these datasets lack ground truth information about the non-stuttered text,
rendering them unsuitable as comprehensive test suites. Therefore, a
methodology for generating stuttering speech as test inputs to test and analyze
the performance of ASR systems is needed. However, generating valid test inputs
in this scenario is challenging. The reason is that although the generated test
inputs should mimic how stutterers speak, they should also be diverse enough to
trigger more failures. To address the challenge, we propose ASTER, a technique
for automatically testing the accessibility of ASR systems. ASTER can generate
valid test cases by injecting five different types of stuttering. The generated
test cases can both simulate realistic stuttering speech and expose failures in
ASR systems. Moreover, ASTER can further enhance the quality of the test cases
with a multi-objective optimization-based seed updating algorithm. We
implemented ASTER as a framework and evaluated it on four open-source ASR
models and three commercial ASR systems. We conduct a comprehensive evaluation
of ASTER and find that it significantly increases the word error rate, match
error rate, and word information loss in the evaluated ASR systems.
Additionally, our user study demonstrates that the generated stuttering audio
is indistinguishable from real-world stuttering audio clips.
Related papers
- Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech [0.0]
Speech recognition systems fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations.
This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark.
Results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions.
arXiv Detail & Related papers (2024-05-10T00:16:58Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Enhancing ASR for Stuttered Speech with Limited Data Using Detect and
Pass [0.0]
It is estimated that around 70 million people worldwide are affected by a speech disorder called stuttering.
We propose a simple but effective method called 'Detect and Pass' to make modern ASR systems accessible for People Who Stutter.
arXiv Detail & Related papers (2022-02-08T19:55:23Z) - Cross-Modal ASR Post-Processing System for Error Correction and
Utterance Rejection [25.940199825317073]
We propose a cross-modal post-processing system for speech recognizers.
It fuses acoustic features and textual features from different modalities.
It joints a confidence estimator and an error corrector in multi-task learning fashion.
arXiv Detail & Related papers (2022-01-10T12:29:55Z) - Improving Distinction between ASR Errors and Speech Disfluencies with
Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing.
This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z) - An Approach to Improve Robustness of NLP Systems against ASR Errors [39.57253455717825]
Speech-enabled systems typically first convert audio to text through an automatic speech recognition model and then feed the text to downstream natural language processing modules.
The errors of the ASR system can seriously downgrade the performance of the NLP modules.
Previous work has shown it is effective to employ data augmentation methods to solve this problem by injecting ASR noise during the training process.
arXiv Detail & Related papers (2021-03-25T05:15:43Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.