Use of Speech Impairment Severity for Dysarthric Speech Recognition
- URL: http://arxiv.org/abs/2305.10659v1
- Date: Thu, 18 May 2023 02:42:59 GMT
- Title: Use of Speech Impairment Severity for Dysarthric Speech Recognition
- Authors: Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu
Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu
- Abstract summary: This paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition.
Experiments conducted on UASpeech suggest incorporating speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems.
- Score: 37.93801885333925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key challenge in dysarthric speech recognition is the speaker-level
diversity attributed to both speaker-identity associated factors such as
gender, and speech impairment severity. Most prior researches on addressing
this issue focused on using speaker-identity only. To this end, this paper
proposes a novel set of techniques to use both severity and speaker-identity in
dysarthric speech recognition: a) multitask training incorporating severity
prediction error; b) speaker-severity aware auxiliary feature adaptation; and
c) structured LHUC transforms separately conditioned on speaker-identity and
severity. Experiments conducted on UASpeech suggest incorporating additional
speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and
pre-trained Wav2vec 2.0 ASR systems produced statistically significant WER
reductions up to 4.78% (14.03% relative). Using the best system the lowest
published WER of 17.82% (51.25% on very low intelligibility) was obtained on
UASpeech.
Related papers
- On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and
Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods.
This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z) - Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric
and Elderly Speech Recognition [48.33873602050463]
Speaker adaptation techniques play a key role in personalization of ASR systems for such users.
Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech.
Novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum.
arXiv Detail & Related papers (2022-02-21T15:11:36Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.