Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition
- URL: http://arxiv.org/abs/2206.07327v3
- Date: Thu, 22 Jun 2023 06:31:15 GMT
- Title: Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition
- Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan
Li, Tianzi Wang, Xunying Liu, Helen Meng
- Abstract summary: Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
- Score: 55.25565305101314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Articulatory features are inherently invariant to acoustic signal distortion
and have been successfully incorporated into automatic speech recognition (ASR)
systems designed for normal speech. Their practical application to atypical
task domains such as elderly and disordered speech across languages is often
limited by the difficulty in collecting such specialist data from target
speakers. This paper presents a cross-domain and cross-lingual A2A inversion
approach that utilizes the parallel audio and ultrasound tongue imaging (UTI)
data of the 24-hour TaL corpus in A2A model pre-training before being
cross-domain and cross-lingual adapted to three datasets across two languages:
the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora;
and the English TORGO dysarthric speech data, to produce UTI based articulatory
features. Experiments conducted on three tasks suggested incorporating the
generated articulatory features consistently outperformed the baseline TDNN and
Conformer ASR systems constructed using acoustic features only by statistically
significant word or character error rate reductions up to 4.75%, 2.59% and
2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation,
speaker adaptation and cross system multi-pass decoding were applied.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems.
TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models.
Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z) - Exploring Self-supervised Pre-trained ASR Models For Dysarthric and
Elderly Speech Recognition [57.31233839489528]
This paper explores approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition.
arXiv Detail & Related papers (2023-02-28T13:39:17Z) - Personalized Adversarial Data Augmentation for Dysarthric and Elderly
Speech Recognition [30.885165674448352]
This paper presents a novel set of speaker dependent (GAN) based data augmentation approaches for elderly and dysarthric speech recognition.
GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute.
Consistent performance improvements are retained after applying LHUC based speaker adaptation.
arXiv Detail & Related papers (2022-05-13T04:29:49Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.