Advancing Stuttering Detection via Data Augmentation, Class-Balanced
Loss and Multi-Contextual Deep Learning
- URL: http://arxiv.org/abs/2302.11343v1
- Date: Tue, 21 Feb 2023 14:03:47 GMT
- Title: Advancing Stuttering Detection via Data Augmentation, Class-Balanced
Loss and Multi-Contextual Deep Learning
- Authors: Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
- Abstract summary: Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances and core behaviors.
In this paper, we investigate the effectiveness of data augmentation on top of a multi-branched training scheme to tackle data scarcity.
In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech.
- Score: 7.42741711946564
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Stuttering is a neuro-developmental speech impairment characterized by
uncontrolled utterances (interjections) and core behaviors (blocks,
repetitions, and prolongations), and is caused by the failure of speech
sensorimotors. Due to its complex nature, stuttering detection (SD) is a
difficult task. If detected at an early stage, it could facilitate speech
therapists to observe and rectify the speech patterns of persons who stutter
(PWS). The stuttered speech of PWS is usually available in limited amounts and
is highly imbalanced. To this end, we address the class imbalance problem in
the SD domain via a multibranching (MB) scheme and by weighting the
contribution of classes in the overall loss function, resulting in a huge
improvement in stuttering classes on the SEP-28k dataset over the baseline
(StutterNet). To tackle data scarcity, we investigate the effectiveness of data
augmentation on top of a multi-branched training scheme. The augmented training
outperforms the MB StutterNet (clean) by a relative margin of 4.18% in macro
F1-score (F1). In addition, we propose a multi-contextual (MC) StutterNet,
which exploits different contexts of the stuttered speech, resulting in an
overall improvement of 4.48% in F 1 over the single context based MB
StutterNet. Finally, we have shown that applying data augmentation in the
cross-corpora scenario can improve the overall SD performance by a relative
margin of 13.23% in F1 over the clean training.
Related papers
- MMSD-Net: Towards Multi-modal Stuttering Detection [9.257985820122999]
MMSD-Net is the first multi-modal neural framework for stuttering detection.
Our model yields an improvement of 2-17% in the F1-score over existing state-of-the-art uni-modal approaches.
arXiv Detail & Related papers (2024-07-16T08:26:59Z) - Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation [0.0]
A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets.
We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation.
Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
arXiv Detail & Related papers (2024-06-14T16:56:40Z) - Automatically measuring speech fluency in people with aphasia: first
achievements using read-speech data [55.84746218227712]
This study aims at assessing the relevance of a signalprocessingalgorithm, initially developed in the field of language acquisition, for the automatic measurement of speech fluency.
arXiv Detail & Related papers (2023-08-09T07:51:40Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Overlapping Word Removal is All You Need: Revisiting Data Imbalance in
Hope Speech Detection [2.8341970739919433]
We introduce focal loss, data augmentation, and pre-processing strategies for hope speech detection.
We find that introducing focal loss mitigates the effect of class imbalance and improves overall F1-Macro by 0.11.
We also show that overlapping word removal based on pre-processing, though simple, improves F1-Macro by 0.28.
arXiv Detail & Related papers (2022-04-12T02:38:54Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Improved Robustness to Disfluencies in RNN-Transducer Based Speech
Recognition [1.8702587873591643]
We investigate data selection and preparation choices aiming for improved robustness of RNN-T ASR to speech disfluencies.
We show that after including a small amount of data with disfluencies in the training set the recognition accuracy on the tests with disfluencies and stuttering improves.
arXiv Detail & Related papers (2020-12-11T11:47:13Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.