Related papers: Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

URL: http://arxiv.org/abs/2204.01670v1
Date: Mon, 4 Apr 2022 17:36:01 GMT
Title: Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition
Authors: Abner Hernandez, Paula Andrea P\'erez-Toro, Elmar N\"oth, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang
Abstract summary: This study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. We train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance.
Score: 15.136348385992047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of speech such as articulation, prosody and phonation can be impaired. Specifically, we train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance. In particular, features from the multilingual model led to lower WERs than filterbanks (Fbank) or models trained on a single language. Improvements were observed in English speakers with cerebral palsy caused dysarthria (UASpeech corpus), Spanish speakers with Parkinsonian dysarthria (PC-GITA corpus) and Italian speakers with paralysis-based dysarthria (EasyCall corpus). Compared to using Fbank features, XLSR-based features reduced WERs by 6.8%, 22.0%, and 7.0% for the UASpeech, PC-GITA, and EasyCall corpus, respectively.

Related papers

Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech [0.562479170374811]
Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition systems.<n>We propose a practical and lightweight pipeline to personalize ASR models, formalizing the selection of words and enriching a small, speech-impaired dataset with semantic coherence.<n>Our approach shows promising improvements in transcription quality, demonstrating the potential to reduce communication barriers for individuals with atypical speech patterns.
arXiv Detail & Related papers (2025-06-23T15:30:50Z)
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages [32.61962553268565]
We fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions.<n>We then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.<n>The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingually Speech (MMS)
arXiv Detail & Related papers (2025-05-20T20:03:45Z)
Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling [0.0]
This study assesses five cutting-edge ASR systems' recognition of non-native English accented speech using recordings from the L2-ARCTIC corpus. For read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively. For spontaneous speech, RevAI performed best with a mean MER of 0.063.
arXiv Detail & Related papers (2025-03-10T05:09:44Z)
Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR [18.701864254184308]
We combine rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria.
arXiv Detail & Related papers (2025-01-17T15:39:21Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems. We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0 [0.22940141855172028]
Fine-tuning wav2vec 2.0 for the classification of stuttering on a sizeable English corpus boosts the effectiveness of the general-purpose features. We evaluate our method on Fluencybank and the German therapy-centric Kassel State of Fluency dataset.
arXiv Detail & Related papers (2022-04-07T13:02:12Z)
Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations. Converted voices retain a low word error rate within 1% of the original voice. Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z)
Speaker Identity Preservation in Dysarthric Speech Reconstruction by Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech. Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity. We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z)
Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies. This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z)
Comparing Supervised Models And Learned Speech Representations For Classifying Intelligibility Of Disordered Speech On Selected Phrases [11.3463024120429]
We develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases.
arXiv Detail & Related papers (2021-07-08T17:24:25Z)
Analysis and Tuning of a Voice Assistant System for Dysfluent Speech [7.233685721929227]
Speech recognition systems do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. We show that by tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24% (relative) for individuals with fluency disorders.
arXiv Detail & Related papers (2021-06-18T20:58:34Z)
Leveraging neural representations for facilitating access to untranscribed speech from endangered languages [10.61744395262441]
We use data selected from 7 Australian Aboriginal languages and a regional variety of Dutch. We find that representations from the middle layers of the wav2vec 2.0 Transformer offer large gains in task performance. While features extracted using the pre-trained English model yielded improved detection on all the evaluation languages, better detection performance was associated with the evaluation language's phonological similarity to English.
arXiv Detail & Related papers (2021-03-26T16:44:08Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning. A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.