Recent Progress in the CUHK Dysarthric Speech Recognition System
- URL: http://arxiv.org/abs/2201.05845v1
- Date: Sat, 15 Jan 2022 13:02:40 GMT
- Title: Recent Progress in the CUHK Dysarthric Speech Recognition System
- Authors: Shansong Liu, Mengzhe Geng, Shoukang Hu, Xurong Xie, Mingyu Cui,
Jianwei Yu, Xunying Liu and Helen Meng
- Abstract summary: Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
- Score: 66.69024814159447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies
in the past few decades, recognition of disordered speech remains a highly
challenging task to date. Disordered speech presents a wide spectrum of
challenges to current data intensive deep neural networks (DNNs) based ASR
technologies that predominantly target normal speech. This paper presents
recent research efforts at the Chinese University of Hong Kong (CUHK) to
improve the performance of disordered speech recognition systems on the largest
publicly available UASpeech dysarthric speech corpus. A set of novel modelling
techniques including neural architectural search, data augmentation using
spectra-temporal perturbation, model based speaker adaptation and cross-domain
generation of visual features within an audio-visual speech recognition (AVSR)
system framework were employed to address the above challenges. The combination
of these techniques produced the lowest published word error rate (WER) of
25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER
reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric
speech recognition system featuring a 6-way DNN system combination and cross
adaptation of out-of-domain normal speech data trained systems. Bayesian model
adaptation further allows rapid adaptation to individual dysarthric speakers to
be performed using as little as 3.06 seconds of speech. The efficacy of these
techniques were further demonstrated on a CUDYS Cantonese dysarthric speech
recognition task.
Related papers
- UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech.
We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement.
Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z) - Use of Speech Impairment Severity for Dysarthric Speech Recognition [37.93801885333925]
This paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition.
Experiments conducted on UASpeech suggest incorporating speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems.
arXiv Detail & Related papers (2023-05-18T02:42:59Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric
and Elderly Speech Recognition [48.33873602050463]
Speaker adaptation techniques play a key role in personalization of ASR systems for such users.
Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech.
Novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum.
arXiv Detail & Related papers (2022-02-21T15:11:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Improving Dysarthric Speech Intelligibility Using Cycle-consistent
Adversarial Training [4.050982413149992]
The goal of our work is to develop a model for dysarthric to healthy speech conversion using Cycle-consistent GAN.
The generator is trained to transform dysarthric to healthy speech in the spectral domain, which is then converted back to speech.
arXiv Detail & Related papers (2020-01-10T01:40:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.