Comparing Supervised Models And Learned Speech Representations For
Classifying Intelligibility Of Disordered Speech On Selected Phrases
- URL: http://arxiv.org/abs/2107.03985v1
- Date: Thu, 8 Jul 2021 17:24:25 GMT
- Title: Comparing Supervised Models And Learned Speech Representations For
Classifying Intelligibility Of Disordered Speech On Selected Phrases
- Authors: Subhashini Venugopalan, Joel Shor, Manoj Plakal, Jimmy Tobin, Katrin
Tomanek, Jordan R. Green, Michael P. Brenner
- Abstract summary: We develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases.
We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases.
- Score: 11.3463024120429
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Automatic classification of disordered speech can provide an objective tool
for identifying the presence and severity of speech impairment. Classification
approaches can also help identify hard-to-recognize speech samples to teach ASR
systems about the variable manifestations of impaired speech. Here, we develop
and compare different deep learning techniques to classify the intelligibility
of disordered speech on selected phrases. We collected samples from a diverse
set of 661 speakers with a variety of self-reported disorders speaking 29 words
or phrases, which were rated by speech-language pathologists for their overall
intelligibility using a five-point Likert scale. We then evaluated classifiers
developed using 3 approaches: (1) a convolutional neural network (CNN) trained
for the task, (2) classifiers trained on non-semantic speech representations
from CNNs that used an unsupervised objective [1], and (3) classifiers trained
on the acoustic (encoder) embeddings from an ASR system trained on typical
speech [2]. We found that the ASR encoder's embeddings considerably outperform
the other two on detecting and classifying disordered speech. Further analysis
shows that the ASR embeddings cluster speech by the spoken phrase, while the
non-semantic embeddings cluster speech by speaker. Also, longer phrases are
more indicative of intelligibility deficits than single words.
Related papers
- Self-supervised Speech Models for Word-Level Stuttered Speech Detection [66.46810024006712]
We introduce a word-level stuttering speech detection model leveraging self-supervised speech models.
Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection.
arXiv Detail & Related papers (2024-09-16T20:18:20Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Latent Phrase Matching for Dysarthric Speech [23.23672790496787]
Many consumer speech recognition systems are not tuned for people with speech disabilities.
We propose a query-by-example-based personalized phrase recognition system that is trained using small amounts of speech.
Performance degrades as the number of phrases increases, but consistently outperforms ASR systems when trained with 50 unique phrases.
arXiv Detail & Related papers (2023-06-08T17:28:28Z) - Decoding speech perception from non-invasive brain recordings [48.46819575538446]
We introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from non-invasive recordings.
Our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities.
arXiv Detail & Related papers (2022-08-25T10:01:43Z) - Cross-lingual Self-Supervised Speech Representations for Improved
Dysarthric Speech Recognition [15.136348385992047]
This study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech.
We train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model.
Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance.
arXiv Detail & Related papers (2022-04-04T17:36:01Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - U-vectors: Generating clusterable speaker embedding from unlabeled data [0.0]
This paper introduces a speaker recognition strategy dealing with unlabeled data.
It generates clusterable embedding vectors from small fixed-size speech frames.
We conclude that the proposed approach achieves remarkable performance using pairwise architectures.
arXiv Detail & Related papers (2021-02-07T18:00:09Z) - Adversarially learning disentangled speech representations for robust
multi-factor voice conversion [39.91395314356084]
We propose a disentangled speech representation learning framework based on adversarial learning.
Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled.
Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors.
arXiv Detail & Related papers (2021-01-30T08:29:55Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.