Acoustic-to-articulatory inversion for dysarthric speech: Are
pre-trained self-supervised representations favorable?
- URL: http://arxiv.org/abs/2309.01108v4
- Date: Fri, 9 Feb 2024 23:01:51 GMT
- Title: Acoustic-to-articulatory inversion for dysarthric speech: Are
pre-trained self-supervised representations favorable?
- Authors: Sarthak Kumar Maharana, Krishna Kamal Adidam, Shoumik Nandi, Ajitesh
Srivastava
- Abstract summary: Acoustic-to-articulatory inversion (AAI) involves mapping from the acoustic to the articulatory space.
In this work, we perform AAI for dysarthric speech using representations from pre-trained self-supervised learning (SSL) models.
- Score: 3.43759997215733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Acoustic-to-articulatory inversion (AAI) involves mapping from the acoustic
to the articulatory space. Signal-processing features like the MFCCs, have been
widely used for the AAI task. For subjects with dysarthric speech, AAI is
challenging because of an imprecise and indistinct pronunciation. In this work,
we perform AAI for dysarthric speech using representations from pre-trained
self-supervised learning (SSL) models. We demonstrate the impact of different
pre-trained features on this challenging AAI task, at low-resource conditions.
In addition, we also condition x-vectors to the extracted SSL features to train
a BLSTM network. In the seen case, we experiment with three AAI training
schemes (subject-specific, pooled, and fine-tuned). The results, consistent
across training schemes, reveal that DeCoAR, in the fine-tuned scheme, achieves
a relative improvement of the Pearson Correlation Coefficient (CC) by ~1.81%
and ~4.56% for healthy controls and patients, respectively, over MFCCs. We
observe similar average trends for different SSL features in the unseen case.
Overall, SSL networks like wav2vec, APC, and DeCoAR, trained with feature
reconstruction or future timestep prediction tasks, perform well in predicting
dysarthric articulatory trajectories.
Related papers
- Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns.
In healthcare settings, coomunication breakdowns reduce the quality of care.
We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data.
arXiv Detail & Related papers (2024-11-01T19:11:54Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems.
TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models.
Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z) - Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer [56.17737749551133]
We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression.
By taking advantage of high-quality pretrained speech features and longitudinal information in the recordings, our best model achieves 91.0% AUC.
ALST is capable of fine-grained and interpretable predictions of ALS progression, especially for distinguishing between rarer and more severe cases.
arXiv Detail & Related papers (2024-06-26T13:28:24Z) - Phonetic and Prosody-aware Self-supervised Learning Approach for
Non-native Fluency Scoring [13.817385516193445]
Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features.
Deep neural networks are commonly trained to map fluency-related features into the human scores.
We introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring.
arXiv Detail & Related papers (2023-05-19T05:39:41Z) - Evidence of Vocal Tract Articulation in Self-Supervised Learning of
Speech [15.975756437343742]
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech.
We conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA)
Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
arXiv Detail & Related papers (2022-10-21T04:24:29Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.