Related papers: Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?

Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?

URL: http://arxiv.org/abs/2309.01108v4
Date: Fri, 9 Feb 2024 23:01:51 GMT
Title: Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?
Authors: Sarthak Kumar Maharana, Krishna Kamal Adidam, Shoumik Nandi, Ajitesh Srivastava
Abstract summary: Acoustic-to-articulatory inversion (AAI) involves mapping from the acoustic to the articulatory space. In this work, we perform AAI for dysarthric speech using representations from pre-trained self-supervised learning (SSL) models.
Score: 3.43759997215733
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Acoustic-to-articulatory inversion (AAI) involves mapping from the acoustic to the articulatory space. Signal-processing features like the MFCCs, have been widely used for the AAI task. For subjects with dysarthric speech, AAI is challenging because of an imprecise and indistinct pronunciation. In this work, we perform AAI for dysarthric speech using representations from pre-trained self-supervised learning (SSL) models. We demonstrate the impact of different pre-trained features on this challenging AAI task, at low-resource conditions. In addition, we also condition x-vectors to the extracted SSL features to train a BLSTM network. In the seen case, we experiment with three AAI training schemes (subject-specific, pooled, and fine-tuned). The results, consistent across training schemes, reveal that DeCoAR, in the fine-tuned scheme, achieves a relative improvement of the Pearson Correlation Coefficient (CC) by ~1.81% and ~4.56% for healthy controls and patients, respectively, over MFCCs. We observe similar average trends for different SSL features in the unseen case. Overall, SSL networks like wav2vec, APC, and DeCoAR, trained with feature reconstruction or future timestep prediction tasks, perform well in predicting dysarthric articulatory trajectories.

Related papers

Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning [84.02184773383732]
Zero-shot Learning attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information. In this paper, we propose an extremely simple Out-of-distribution(OOD) detection based AV-GZSL method(EZ-AVOOD) to mitigate bias problem. Compared to existing state-of-the-art methods, our model superior ZSL and GZSL performances on three audio-visual datasets.
arXiv Detail & Related papers (2025-03-28T07:28:56Z)
Recognition of Dysarthria in Amyotrophic Lateral Sclerosis patients using Hypernetworks [7.182245711235296]
We present the first study incorporating hypernetworks for recognizing dysarthria. Specifically, we use audio files, convert them into log-Mel spectrogram, delta, and delta-delta, and pass the resulting image through a pretrained modified AlexNet model. Results showed that the proposed approach reaches Accuracy up to 82.66% outperforming strong baselines.
arXiv Detail & Related papers (2025-02-27T15:57:37Z)
Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing [19.205671029694074]
Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors. This paper investigates whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech.
arXiv Detail & Related papers (2025-01-10T14:18:21Z)
Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction.
arXiv Detail & Related papers (2024-11-01T19:11:54Z)
Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features. Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation. Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z)
Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems. TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z)
Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer [56.17737749551133]
We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression. By taking advantage of high-quality pretrained speech features and longitudinal information in the recordings, our best model achieves 91.0% AUC. ALST is capable of fine-grained and interpretable predictions of ALS progression, especially for distinguishing between rarer and more severe cases.
arXiv Detail & Related papers (2024-06-26T13:28:24Z)
Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring [13.817385516193445]
Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features. Deep neural networks are commonly trained to map fluency-related features into the human scores. We introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring.
arXiv Detail & Related papers (2023-05-19T05:39:41Z)
Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech [15.975756437343742]
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech. We conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA) Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
arXiv Detail & Related papers (2022-10-21T04:24:29Z)
Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z)
Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z)
Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech. This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training. Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.