Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation
- URL: http://arxiv.org/abs/2406.10177v2
- Date: Tue, 01 Oct 2024 21:13:39 GMT
- Title: Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation
- Authors: Dena Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia Bin,
- Abstract summary: A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets.
We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation.
Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
- Score: 0.0
- License:
- Abstract: Automatic speech recognition (ASR) systems often falter while processing stuttering-related disfluencies -- such as involuntary blocks and word repetitions -- yielding inaccurate transcripts. A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets. Therefore, we present an inclusive ASR design approach, leveraging large-scale self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation on a smaller, curated dataset of disfluent speech. Our data augmentation technique enriches training datasets with various disfluencies, enhancing ASR processing of these speech patterns. Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech. Our approach not only advances ASR inclusivity for people who stutter, but also paves the way for ASRs that can accommodate wider speech variations.
Related papers
- Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns.
In healthcare settings, coomunication breakdowns reduce the quality of care.
We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data.
arXiv Detail & Related papers (2024-11-01T19:11:54Z) - MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues [41.23757609484281]
Speech recognition errors can significantly degrade the performance of downstream tasks like summarization.
We propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models.
LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems.
arXiv Detail & Related papers (2024-08-26T17:04:00Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech [0.0]
Speech recognition systems fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations.
This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark.
Results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions.
arXiv Detail & Related papers (2024-05-10T00:16:58Z) - BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition [72.51848069125822]
We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data.
Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods.
Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
arXiv Detail & Related papers (2024-04-02T16:48:20Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Improved Robustness to Disfluencies in RNN-Transducer Based Speech
Recognition [1.8702587873591643]
We investigate data selection and preparation choices aiming for improved robustness of RNN-T ASR to speech disfluencies.
We show that after including a small amount of data with disfluencies in the training set the recognition accuracy on the tests with disfluencies and stuttering improves.
arXiv Detail & Related papers (2020-12-11T11:47:13Z) - Data Augmentation for Training Dialog Models Robust to Speech
Recognition Errors [5.53506103787497]
Speech-based virtual assistants, such as Amazon Alexa, Google assistant, and Apple Siri, typically convert users' audio signals to text data through automatic speech recognition (ASR)
The ASR output is error-prone; however, the downstream dialog models are often trained on error-free text data, making them sensitive to ASR errors during inference time.
We leverage an ASR error simulator to inject noise into the error-free text data, and subsequently train the dialog models with the augmented data.
arXiv Detail & Related papers (2020-06-10T03:18:15Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.