Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale
Speech Recognition
- URL: http://arxiv.org/abs/2402.18923v1
- Date: Thu, 29 Feb 2024 07:29:42 GMT
- Title: Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale
Speech Recognition
- Authors: Jeehyun Lee, Yerin Choi, Tae-Jin Song, Myoung-Wan Koo
- Abstract summary: Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy.
We propose a large-scale speech recognition model for inappropriate pause detection in dysarthric speech.
Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines.
- Score: 2.7309692684728617
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Dysarthria, a common issue among stroke patients, severely impacts speech
intelligibility. Inappropriate pauses are crucial indicators in severity
assessment and speech-language therapy. We propose to extend a large-scale
speech recognition model for inappropriate pause detection in dysarthric
speech. To this end, we propose task design, labeling strategy, and a speech
recognition model with an inappropriate pause prediction layer. First, we treat
pause detection as speech recognition, using an automatic speech recognition
(ASR) model to convert speech into text with pause tags. According to the newly
designed task, we label pause locations at the text level and their
appropriateness. We collaborate with speech-language pathologists to establish
labeling criteria, ensuring high-quality annotated data. Finally, we extend the
ASR model with an inappropriate pause prediction layer for end-to-end
inappropriate pause detection. Moreover, we propose a task-tailored metric for
evaluating inappropriate pause detection independent of ASR performance. Our
experiments show that the proposed method better detects inappropriate pauses
in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Self-supervised Speech Models for Word-Level Stuttered Speech Detection [66.46810024006712]
We introduce a word-level stuttering speech detection model leveraging self-supervised speech models.
Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection.
arXiv Detail & Related papers (2024-09-16T20:18:20Z) - STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - Infusing Acoustic Pause Context into Text-Based Dementia Assessment [7.8642589679025034]
This work investigates the use of pause-enriched transcripts in language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia based on their speech from a clinical assessment.
The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model's effectiveness across different speech production contexts.
arXiv Detail & Related papers (2024-08-27T16:44:41Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - End-to-end Speech-to-Punctuated-Text Recognition [23.44236710364419]
punctuation marks are important for the readability of the speech recognition results.
Conventional automatic speech recognition systems do not produce punctuation marks.
We propose an end-to-end model that takes speech as input and outputs punctuated texts.
arXiv Detail & Related papers (2022-07-07T08:58:01Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Towards Interpretability of Speech Pause in Dementia Detection using
Adversarial Learning [4.19159477763309]
Speech pause is an effective biomarker in dementia detection.
Recent deep learning models have exploited speech pauses to achieve highly accurate dementia detection.
We will study the positions and lengths of dementia-sensitive pauses using adversarial learning approaches.
arXiv Detail & Related papers (2021-11-14T21:26:18Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - End-to-End Speech Recognition and Disfluency Removal [15.910282983166024]
This paper investigates the task of end-to-end speech recognition and disfluency removal.
We show that end-to-end models do learn to directly generate fluent transcripts.
We propose two new metrics that can be used for evaluating integrated ASR and disfluency models.
arXiv Detail & Related papers (2020-09-22T03:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.