Speaker-Independent Dysarthria Severity Classification using
Self-Supervised Transformers and Multi-Task Learning
- URL: http://arxiv.org/abs/2403.00854v1
- Date: Thu, 29 Feb 2024 18:30:52 GMT
- Title: Speaker-Independent Dysarthria Severity Classification using
Self-Supervised Transformers and Multi-Task Learning
- Authors: Lauren Stumpf and Balasundaram Kadirvelu and Sigourney Waibel and A.
Aldo Faisal
- Abstract summary: This study presents a transformer-based framework for automatically assessing dysarthria severity from raw speech data.
We develop a framework, called Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task learning objective and contrastive learning for speaker-independent multi-class dysarthria severity classification.
Our model demonstrated superior performance over traditional machine learning approaches, with an accuracy of $70.48%$ and an F1 score of $59.23%$.
- Score: 2.7706924578324665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dysarthria, a condition resulting from impaired control of the speech muscles
due to neurological disorders, significantly impacts the communication and
quality of life of patients. The condition's complexity, human scoring and
varied presentations make its assessment and management challenging. This study
presents a transformer-based framework for automatically assessing dysarthria
severity from raw speech data. It can offer an objective, repeatable,
accessible, standardised and cost-effective and compared to traditional methods
requiring human expert assessors. We develop a transformer framework, called
Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task
learning objective and contrastive learning for speaker-independent multi-class
dysarthria severity classification. The multi-task framework is designed to
reduce reliance on speaker-specific characteristics and address the intrinsic
intra-class variability of dysarthric speech. We evaluated on the Universal
Access Speech dataset using leave-one-speaker-out cross-validation, our model
demonstrated superior performance over traditional machine learning approaches,
with an accuracy of $70.48\%$ and an F1 score of $59.23\%$. Our SALR model also
exceeded the previous benchmark for AI-based classification, which used support
vector machines, by $16.58\%$. We open the black box of our model by
visualising the latent space where we can observe how the model substantially
reduces speaker-specific cues and amplifies task-specific ones, thereby showing
its robustness. In conclusion, SALR establishes a new benchmark in
speaker-independent multi-class dysarthria severity classification using
generative AI. The potential implications of our findings for broader clinical
applications in automated dysarthria severity assessments.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Detecting Speech Abnormalities with a Perceiver-based Sequence
Classifier that Leverages a Universal Speech Model [4.503292461488901]
We propose a Perceiver-based sequence to detect abnormalities in speech reflective of several neurological disorders.
We combine this sequence with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings.
Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%.
arXiv Detail & Related papers (2023-10-16T21:07:12Z) - A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment [6.284142286798582]
We show that HuBERT is the most versatile feature extractor across dysarthria classification, word recognition, and intelligibility classification, achieving respectively $+24.7%, +61%, textand +7.2%$ accuracy compared to classical acoustic features.
arXiv Detail & Related papers (2023-06-07T11:04:02Z) - Automatic Severity Classification of Dysarthric speech by using
Self-supervised Model with Multi-task Learning [4.947423926765435]
We propose a novel automatic severity assessment method for dysarthric speech using the self-supervised model in conjunction with multi-task learning.
Wav2vec 2.0 XLS-R is trained for two different tasks: severity classification and auxiliary automatic speech recognition (ASR)
Our model outperforms the traditional baseline methods, with a relative percentage increase of 1.25% for F1-score.
arXiv Detail & Related papers (2022-10-27T12:48:10Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker
Identity in Dysarthric Voice Conversion [50.040466658605524]
We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC)
The poor quality of dysarthric speech can be greatly improved by statistical VC.
But as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient.
arXiv Detail & Related papers (2021-06-02T18:41:03Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.