MMSD-Net: Towards Multi-modal Stuttering Detection
- URL: http://arxiv.org/abs/2407.11492v1
- Date: Tue, 16 Jul 2024 08:26:59 GMT
- Title: MMSD-Net: Towards Multi-modal Stuttering Detection
- Authors: Liangyu Nie, Sudarsana Reddy Kadiri, Ruchit Agrawal,
- Abstract summary: MMSD-Net is the first multi-modal neural framework for stuttering detection.
Our model yields an improvement of 2-17% in the F1-score over existing state-of-the-art uni-modal approaches.
- Score: 9.257985820122999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stuttering is a common speech impediment that is caused by irregular disruptions in speech production, affecting over 70 million people across the world. Standard automatic speech processing tools do not take speech ailments into account and are thereby not able to generate meaningful results when presented with stuttered speech as input. The automatic detection of stuttering is an integral step towards building efficient, context-aware speech processing systems. While previous approaches explore both statistical and neural approaches for stuttering detection, all of these methods are uni-modal in nature. This paper presents MMSD-Net, the first multi-modal neural framework for stuttering detection. Experiments and results demonstrate that incorporating the visual signal significantly aids stuttering detection, and our model yields an improvement of 2-17% in the F1-score over existing state-of-the-art uni-modal approaches.
Related papers
- Self-supervised Speech Models for Word-Level Stuttered Speech Detection [66.46810024006712]
We introduce a word-level stuttering speech detection model leveraging self-supervised speech models.
Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection.
arXiv Detail & Related papers (2024-09-16T20:18:20Z) - AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection [46.855958156126164]
AS-70 is the first publicly available Mandarin stuttered speech dataset.
This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset.
arXiv Detail & Related papers (2024-06-11T13:35:50Z) - Advancing Stuttering Detection via Data Augmentation, Class-Balanced
Loss and Multi-Contextual Deep Learning [7.42741711946564]
Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances and core behaviors.
In this paper, we investigate the effectiveness of data augmentation on top of a multi-branched training scheme to tackle data scarcity.
In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech.
arXiv Detail & Related papers (2023-02-21T14:03:47Z) - Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered
Speech [20.2646788350211]
Stuttering is a speech disorder where the natural flow of speech is interrupted by blocks, repetitions or prolongations of syllables, words and phrases.
We describe Stutter-TTS, an end-to-end neural text-to-speech model capable of synthesizing diverse types of stuttering utterances.
arXiv Detail & Related papers (2022-11-04T23:45:31Z) - Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0 [0.22940141855172028]
Fine-tuning wav2vec 2.0 for the classification of stuttering on a sizeable English corpus boosts the effectiveness of the general-purpose features.
We evaluate our method on Fluencybank and the German therapy-centric Kassel State of Fluency dataset.
arXiv Detail & Related papers (2022-04-07T13:02:12Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - StutterNet: Stuttering Detection Using Time Delay Neural Network [9.726119468893721]
This paper introduce StutterNet, a novel deep learning based stuttering detection system.
We use a time-delay neural network (TDNN) suitable for capturing contextual aspects of the disfluent utterances.
Our method achieves promising results and outperforms the state-of-the-art residual neural network based method.
arXiv Detail & Related papers (2021-05-12T11:36:01Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.