UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization
- URL: http://arxiv.org/abs/2401.14664v1
- Date: Fri, 26 Jan 2024 06:08:47 GMT
- Title: UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization
- Authors: Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng
- Abstract summary: Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech.
We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement.
Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
- Score: 60.43992089087448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dysarthric speech reconstruction (DSR) systems aim to automatically convert
dysarthric speech into normal-sounding speech. The technology eases
communication with speakers affected by the neuromotor disorder and enhances
their social inclusion. NED-based (Neural Encoder-Decoder) systems have
significantly improved the intelligibility of the reconstructed speech as
compared with GAN-based (Generative Adversarial Network) approaches, but the
approach is still limited by training inefficiency caused by the cascaded
pipeline and auxiliary tasks of the content encoder, which may in turn affect
the quality of reconstruction. Inspired by self-supervised speech
representation learning and discrete speech units, we propose a Unit-DSR
system, which harnesses the powerful domain-adaptation capacity of HuBERT for
training efficiency improvement and utilizes speech units to constrain the
dysarthric content restoration in a discrete linguistic space. Compared with
NED approaches, the Unit-DSR system only consists of a speech unit normalizer
and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded
sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that
Unit-DSR outperforms competitive baselines in terms of content restoration,
reaching a 28.2% relative average word error rate reduction when compared to
original dysarthric speech, and shows robustness against speed perturbation and
noise.
Related papers
- Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns.
In healthcare settings, coomunication breakdowns reduce the quality of care.
We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data.
arXiv Detail & Related papers (2024-11-01T19:11:54Z) - CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Accurate synthesis of Dysarthric Speech for ASR data augmentation [5.223856537504927]
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility.
This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation.
arXiv Detail & Related papers (2023-08-16T15:42:24Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.