Related papers: Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech

Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech

URL: http://arxiv.org/abs/2506.01618v1
Date: Mon, 02 Jun 2025 12:57:36 GMT
Title: Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech
Authors: Karl El Hajal, Enno Hermann, Sevada Hovsepyan, Mathew Magimai. -Doss,
Abstract summary: We explore dysarthric-to-healthy speech conversion for improved ASR performance.<n>Our approach extends the Rhythm and Voice (RnV) conversion framework by introducing a syllable-based rhythm modeling method.<n>Experiments on the Torgo corpus reveal that LF-MMI achieves significant word error rate reductions.
Score: 17.105048387175817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic speech recognition (ASR) systems struggle with dysarthric speech due to high inter-speaker variability and slow speaking rates. To address this, we explore dysarthric-to-healthy speech conversion for improved ASR performance. Our approach extends the Rhythm and Voice (RnV) conversion framework by introducing a syllable-based rhythm modeling method suited for dysarthric speech. We assess its impact on ASR by training LF-MMI models and fine-tuning Whisper on converted speech. Experiments on the Torgo corpus reveal that LF-MMI achieves significant word error rate reductions, especially for more severe cases of dysarthria, while fine-tuning Whisper on converted data has minimal effect on its performance. These results highlight the potential of unsupervised rhythm and voice conversion for dysarthric ASR. Code available at: https://github.com/idiap/RnV

Related papers

Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching [0.0]
Dysarthria is a neurological disorder that significantly impairs speech intelligibility.<n>This necessitates the development of robust dysarthric-to-regular speech conversion techniques.
arXiv Detail & Related papers (2025-06-19T08:24:17Z)
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages [32.61962553268565]
We fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions.<n>We then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.<n>The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingually Speech (MMS)
arXiv Detail & Related papers (2025-05-20T20:03:45Z)
Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR [18.701864254184308]
We combine rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech.<n>We find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria.
arXiv Detail & Related papers (2025-01-17T15:39:21Z)
Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction.
arXiv Detail & Related papers (2024-11-01T19:11:54Z)
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech. We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z)
Speaker Identity Preservation in Dysarthric Speech Reconstruction by Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech. Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity. We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z)
Investigation of Data Augmentation Techniques for Disordered Speech Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition. Both normal and disordered speech were exploited in the augmentation process. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)
Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.