Related papers: Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

URL: http://arxiv.org/abs/2407.06310v1
Date: Mon, 8 Jul 2024 18:20:24 GMT
Title: Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation
Authors: Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu,
Abstract summary: This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features. Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation. Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
Score: 71.31331402404662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.

Related papers

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z)
Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition [30.885165674448352]
This paper presents a novel set of speaker dependent (GAN) based data augmentation approaches for elderly and dysarthric speech recognition. GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute. Consistent performance improvements are retained after applying LHUC based speaker adaptation.
arXiv Detail & Related papers (2022-05-13T04:29:49Z)
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z)
Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition [48.33873602050463]
Speaker adaptation techniques play a key role in personalization of ASR systems for such users. Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech. Novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum.
arXiv Detail & Related papers (2022-02-21T15:11:36Z)
Investigation of Data Augmentation Techniques for Disordered Speech Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition. Both normal and disordered speech were exploited in the augmentation process. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z)
Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.