Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition
- URL: http://arxiv.org/abs/2203.10274v1
- Date: Sat, 19 Mar 2022 08:47:18 GMT
- Title: Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition
- Authors: Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang,
Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng
- Abstract summary: Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
- Score: 57.15942628305797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Articulatory features are inherently invariant to acoustic signal distortion
and have been successfully incorporated into automatic speech recognition (ASR)
systems for normal speech. Their practical application to disordered speech
recognition is often limited by the difficulty in collecting such specialist
data from impaired speakers. This paper presents a cross-domain
acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel
acoustic-articulatory data of the 15-hour TORGO corpus in model training before
being cross-domain adapted to the 102.7-hour UASpeech corpus and to produce
articulatory features. Mixture density networks based neural A2A inversion
models were used. A cross-domain feature adaptation network was also used to
reduce the acoustic mismatch between the TORGO and UASpeech data. On both
tasks, incorporating the A2A generated articulatory features consistently
outperformed the baseline hybrid DNN/TDNN, CTC and Conformer based end-to-end
systems constructed using acoustic features only. The best multi-modal system
incorporating video modality and the cross-domain articulatory features as well
as data augmentation and learning hidden unit contributions (LHUC) speaker
adaptation produced the lowest published word error rate (WER) of 24.82% on the
16 dysarthric speakers of the benchmark UASpeech task.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech.
We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement.
Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z) - Exploring Self-supervised Pre-trained ASR Models For Dysarthric and
Elderly Speech Recognition [57.31233839489528]
This paper explores approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition.
arXiv Detail & Related papers (2023-02-28T13:39:17Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Acoustic-to-articulatory Inversion based on Speech Decomposition and
Auxiliary Feature [7.363994037183394]
We pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding.
We then propose a novel auxiliary feature network to estimate the lip auxiliary features from the personalized speech features.
Experimental results show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0%.
arXiv Detail & Related papers (2022-04-02T14:47:19Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.