Related papers: Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech Model

Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech Model

URL: http://arxiv.org/abs/2310.13010v1
Date: Mon, 16 Oct 2023 21:07:12 GMT
Title: Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech Model
Authors: Hagen Soltau, Izhak Shafran, Alex Ottenwess, Joseph R. JR Duffy, Rene L. Utianski, Leland R. Barnard, John L. Stricker, Daniela Wiepert, David T. Jones, Hugo Botha
Abstract summary: We propose a Perceiver-based sequence to detect abnormalities in speech reflective of several neurological disorders. We combine this sequence with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings. Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%.
Score: 4.503292461488901
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings. Our model compresses long sequences into a small set of class-specific latent representations and a factorized projection is used to predict different attributes of the disordered input speech. The benefit of our approach is that it allows us to model different regions of the input for different classes and is at the same time data efficient. We evaluated the proposed model extensively on a curated corpus from the Mayo Clinic. Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%. With limited task-specific data, we find that pretraining is important and surprisingly pretraining with the unrelated automatic speech recognition (ASR) task is also beneficial. Encodings from the middle layers provide a mix of both acoustic and phonetic information and achieve best prediction results compared to just using the final layer encodings (83.1% vs. 79.6%). The results are promising and with further refinements may help clinicians detect speech abnormalities without needing access to highly specialized speech-language pathologists.

Related papers

Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech [14.936023751079654]
Alzheimer's disease and related dementias are progressive neurodegenerative conditions.<n>Spontaneous speech contains rich acoustic and linguistic markers that may serve as non-invasive biomarkers.<n>Foundation models, pre-trained on large-scale audio or text data, produce high-dimensional embeddings encoding contextual and acoustic features.
arXiv Detail & Related papers (2025-06-09T17:52:31Z)
Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Pitch Difference Feature [1.1455937444848385]
We propose a robust set of features derived from a thorough research of contemporary practices in voice pathology detection. We combine this feature set, containing data from the publicly available Saarbr"ucken Voice Database (SVD), with preprocessing using the K-Means Synthetic Minority Over-Sampling Technique algorithm. Our approach has achieved the state-of-the-art performance, measured by unweighted average recall in voice pathology detection.
arXiv Detail & Related papers (2024-10-14T14:17:52Z)
Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning [2.7706924578324665]
This study presents a transformer-based framework for automatically assessing dysarthria severity from raw speech data. We develop a framework, called Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task learning objective and contrastive learning for speaker-independent multi-class dysarthria severity classification. Our model demonstrated superior performance over traditional machine learning approaches, with an accuracy of $70.48%$ and an F1 score of $59.23%$.
arXiv Detail & Related papers (2024-02-29T18:30:52Z)
Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z)
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z)
Automatically measuring speech fluency in people with aphasia: first achievements using read-speech data [55.84746218227712]
This study aims at assessing the relevance of a signalprocessingalgorithm, initially developed in the field of language acquisition, for the automatic measurement of speech fluency.
arXiv Detail & Related papers (2023-08-09T07:51:40Z)
Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z)
A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech Recognition [80.87085897419982]
We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM. Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously. The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
arXiv Detail & Related papers (2022-05-06T06:07:09Z)
Continuous Speech for Improved Learning Pathological Voice Disorders [12.867900671251395]
This study proposes a novel approach, using continuous Mandarin speech instead of a single vowel, to classify four common voice disorders. In the proposed framework, acoustic signals are transformed into mel-frequency cepstral coefficients, and a bi-directional long-short term memory network (BiLSTM) is adopted to model the sequential features.
arXiv Detail & Related papers (2022-02-22T09:58:31Z)
Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z)
Audio Interval Retrieval using Convolutional Neural Networks [0.0]
This article aims to investigate possible solutions to retrieve sound events based on a natural language query. We specifically focus on the YamNet, AlexNet, and ResNet-50 pre-trained models to automatically classify audio samples. Results show that the benchmarked models are comparable in terms of performance, with YamNet slightly outperforming the other two models.
arXiv Detail & Related papers (2021-09-21T01:32:18Z)
Multi-Modal Detection of Alzheimer's Disease from Speech and Text [3.702631194466718]
We propose a deep learning method that utilizes speech and the corresponding transcript simultaneously to detect Alzheimer's disease (AD) The proposed method achieves 85.3% 10-fold cross-validation accuracy when trained and evaluated on the Dementiabank Pitt corpus.
arXiv Detail & Related papers (2020-11-30T21:18:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.