DDKtor: Automatic Diadochokinetic Speech Analysis
- URL: http://arxiv.org/abs/2206.14639v1
- Date: Wed, 29 Jun 2022 13:34:03 GMT
- Title: DDKtor: Automatic Diadochokinetic Speech Analysis
- Authors: Yael Segal, Kasia Hitczenko, Matthew Goldrick, Adam Buchwald, Angela
Roberts and Joseph Keshet
- Abstract summary: This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech.
Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems.
The LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
- Score: 13.68342426889044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diadochokinetic speech tasks (DDK), in which participants repeatedly produce
syllables, are commonly used as part of the assessment of speech motor
impairments. These studies rely on manual analyses that are time-intensive,
subjective, and provide only a coarse-grained picture of speech. This paper
presents two deep neural network models that automatically segment consonants
and vowels from unannotated, untranscribed speech. Both models work on the raw
waveform and use convolutional layers for feature extraction. The first model
is based on an LSTM classifier followed by fully connected layers, while the
second model adds more convolutional layers followed by fully connected layers.
These segmentations predicted by the models are used to obtain measures of
speech rate and sound duration. Results on a young healthy individuals dataset
show that our LSTM model outperforms the current state-of-the-art systems and
performs comparably to trained human annotators. Moreover, the LSTM model also
presents comparable results to trained human annotators when evaluated on
unseen older individuals with Parkinson's Disease dataset.
Related papers
- Phonetic and Prosody-aware Self-supervised Learning Approach for
Non-native Fluency Scoring [13.817385516193445]
Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features.
Deep neural networks are commonly trained to map fluency-related features into the human scores.
We introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring.
arXiv Detail & Related papers (2023-05-19T05:39:41Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Evidence of Vocal Tract Articulation in Self-Supervised Learning of
Speech [15.975756437343742]
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech.
We conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA)
Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
arXiv Detail & Related papers (2022-10-21T04:24:29Z) - Analyzing Robustness of End-to-End Neural Models for Automatic Speech
Recognition [11.489161072526677]
We investigate robustness properties of pre-trained neural models for automatic speech recognition.
In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets.
arXiv Detail & Related papers (2022-08-17T20:00:54Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling [16.43844160498413]
Several recent papers have proposed deep-learning-based assessment models.
We propose three models using cluster-based modeling methods.
We show that the GQT layer helps to predict human assessment better by automatically learning the task.
arXiv Detail & Related papers (2020-08-09T11:14:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.