Personalized Adversarial Data Augmentation for Dysarthric and Elderly
Speech Recognition
- URL: http://arxiv.org/abs/2205.06445v2
- Date: Tue, 17 May 2022 01:22:39 GMT
- Title: Personalized Adversarial Data Augmentation for Dysarthric and Elderly
Speech Recognition
- Authors: Zengrui Jin, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan
Li, Xunying Liu
- Abstract summary: This paper presents a novel set of speaker dependent (GAN) based data augmentation approaches for elderly and dysarthric speech recognition.
GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute.
Consistent performance improvements are retained after applying LHUC based speaker adaptation.
- Score: 30.885165674448352
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies
targeting normal speech, accurate recognition of dysarthric and elderly speech
remains highly challenging tasks to date. It is difficult to collect large
quantities of such data for ASR system development due to the mobility issues
often found among these users. To this end, data augmentation techniques play a
vital role. In contrast to existing data augmentation techniques only modifying
the speaking rate or overall shape of spectral contour, fine-grained
spectro-temporal differences between dysarthric, elderly and normal speech are
modelled using a novel set of speaker dependent (SD) generative adversarial
networks (GAN) based data augmentation approaches in this paper. These flexibly
allow both: a) temporal or speed perturbed normal speech spectra to be modified
and closer to those of an impaired speaker when parallel speech data is
available; and b) for non-parallel data, the SVD decomposed normal speech
spectral basis features to be transformed into those of a target elderly
speaker before being re-composed with the temporal bases to produce the
augmented data for state-of-the-art TDNN and Conformer ASR system training.
Experiments are conducted on four tasks: the English UASpeech and TORGO
dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC
MoCA elderly speech datasets. The proposed GAN based data augmentation
approaches consistently outperform the baseline speed perturbation method by up
to 0.91% and 3.0% absolute (9.61% and 6.4% relative) WER reduction on the TORGO
and DementiaBank data respectively. Consistent performance improvements are
retained after applying LHUC based speaker adaptation.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and
Dysarthric Speech Recognition [64.9816313630768]
Fine-tuning is often used to exploit the large quantities of non-aged and healthy speech pre-trained models.
This paper investigates hyper- parameter adaptation for Conformer ASR systems that are pre-trained on the Librispeech corpus.
arXiv Detail & Related papers (2023-06-27T07:49:35Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and
Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods.
This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z) - Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric
and Elderly Speech Recognition [48.33873602050463]
Speaker adaptation techniques play a key role in personalization of ASR systems for such users.
Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech.
Novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum.
arXiv Detail & Related papers (2022-02-21T15:11:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.