Related papers: Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass

Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass

URL: http://arxiv.org/abs/2202.05396v1
Date: Tue, 8 Feb 2022 19:55:23 GMT
Title: Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass
Authors: Olabanji Shonibare, Xiaosu Tong, Venkatesh Ravichandran
Abstract summary: It is estimated that around 70 million people worldwide are affected by a speech disorder called stuttering. We propose a simple but effective method called 'Detect and Pass' to make modern ASR systems accessible for People Who Stutter.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is estimated that around 70 million people worldwide are affected by a speech disorder called stuttering. With recent advances in Automatic Speech Recognition (ASR), voice assistants are increasingly useful in our everyday lives. Many technologies in education, retail, telecommunication and healthcare can now be operated through voice. Unfortunately, these benefits are not accessible for People Who Stutter (PWS). We propose a simple but effective method called 'Detect and Pass' to make modern ASR systems accessible for People Who Stutter in a limited data setting. The algorithm uses a context aware classifier trained on a limited amount of data, to detect acoustic frames that contain stutter. To improve robustness on stuttered speech, this extra information is passed on to the ASR model to be utilized during inference. Our experiments show a reduction of 12.18% to 71.24% in Word Error Rate (WER) across various state of the art ASR systems. Upon varying the threshold of the associated posterior probability of stutter for each stacked frame used in determining low frame rate (LFR) acoustic features, we were able to determine an optimal setting that reduced the WER by 23.93% to 71.67% across different ASR systems.

Related papers

Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction.
arXiv Detail & Related papers (2024-11-01T19:11:54Z)
Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance. We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information. Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z)
Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation [0.0]
A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets. We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation. Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
arXiv Detail & Related papers (2024-06-14T16:56:40Z)
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition [72.51848069125822]
We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
arXiv Detail & Related papers (2024-04-02T16:48:20Z)
ASTER: Automatic Speech Recognition System Accessibility Testing for Stutterers [25.466850759460364]
We propose ASTER, a technique for automatically testing the accessibility of ASR systems. ASTER generates valid test cases by injecting five different types of stuttering. It significantly increases the word error rate, match error rate, and word information loss in the evaluated ASR systems.
arXiv Detail & Related papers (2023-08-30T03:46:52Z)
Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation [20.45373308116162]
This study focuses on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of
arXiv Detail & Related papers (2023-05-18T13:20:38Z)
Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech [20.2646788350211]
Stuttering is a speech disorder where the natural flow of speech is interrupted by blocks, repetitions or prolongations of syllables, words and phrases. We describe Stutter-TTS, an end-to-end neural text-to-speech model capable of synthesizing diverse types of stuttering utterances.
arXiv Detail & Related papers (2022-11-04T23:45:31Z)
Investigation of Data Augmentation Techniques for Disordered Speech Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition. Both normal and disordered speech were exploited in the augmentation process. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
NUVA: A Naming Utterance Verifier for Aphasia Treatment [49.114436579008476]
Assessment of speech performance using picture naming tasks is a key method for both diagnosis and monitoring of responses to treatment interventions by people with aphasia (PWA) Here we present NUVA, an utterance verification system incorporating a deep learning element that classifies 'correct' versus'incorrect' naming attempts from aphasic stroke patients. When tested on eight native British-English speaking PWA the system's performance accuracy ranged between 83.6% to 93.6%, with a 10-fold cross-validation mean of 89.5%.
arXiv Detail & Related papers (2021-02-10T13:00:29Z)
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.