StutterNet: Stuttering Detection Using Time Delay Neural Network
- URL: http://arxiv.org/abs/2105.05599v1
- Date: Wed, 12 May 2021 11:36:01 GMT
- Title: StutterNet: Stuttering Detection Using Time Delay Neural Network
- Authors: Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
- Abstract summary: This paper introduce StutterNet, a novel deep learning based stuttering detection system.
We use a time-delay neural network (TDNN) suitable for capturing contextual aspects of the disfluent utterances.
Our method achieves promising results and outperforms the state-of-the-art residual neural network based method.
- Score: 9.726119468893721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduce StutterNet, a novel deep learning based stuttering
detection capable of detecting and identifying various types of disfluencies.
Most of the existing work in this domain uses automatic speech recognition
(ASR) combined with language models for stuttering detection. Compared to the
existing work, which depends on the ASR module, our method relies solely on the
acoustic signal. We use a time-delay neural network (TDNN) suitable for
capturing contextual aspects of the disfluent utterances. We evaluate our
system on the UCLASS stuttering dataset consisting of more than 100 speakers.
Our method achieves promising results and outperforms the state-of-the-art
residual neural network based method. The number of trainable parameters of the
proposed method is also substantially less due to the parameter sharing scheme
of TDNN.
Related papers
- YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection [5.42845980208244]
YOLO-Stutter is a first end-to-end method that detects dysfluencies in a time-accurate manner.
VCTK-Stutter and VCTK-TTS simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation.
arXiv Detail & Related papers (2024-08-27T11:31:12Z) - Histogram Layer Time Delay Neural Networks for Passive Sonar
Classification [58.720142291102135]
A novel method combines a time delay neural network and histogram layer to incorporate statistical contexts for improved feature learning and underwater acoustic target classification.
The proposed method outperforms the baseline model, demonstrating the utility in incorporating statistical contexts for passive sonar target recognition.
arXiv Detail & Related papers (2023-07-25T19:47:26Z) - Adaptive Axonal Delays in feedforward spiking neural networks for
accurate spoken word recognition [4.018601183900039]
Spiking neural networks (SNN) are a promising research avenue for building accurate and efficient automatic speech recognition systems.
Recent advances in audio-to-spike encoding and training algorithms enable SNN to be applied in practical tasks.
Our work illustrates the potential of training axonal delays for tasks with complex temporal structures.
arXiv Detail & Related papers (2023-02-16T22:19:04Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Blind Speech Separation and Dereverberation using Neural Beamforming [28.7807578839021]
We present the Blind Speech Separation and Dereverberation (BSSD) network, which performs simultaneous speaker separation, dereverberation and speaker identification in a single neural network.
Speaker separation is guided by a set of predefined spatial cues. Dereverberation is performed by using neural beamforming, and speaker identification is aided by embedding vectors and triplet mining.
arXiv Detail & Related papers (2021-03-24T18:43:52Z) - End to End ASR System with Automatic Punctuation Insertion [0.0]
We propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from ted.com.
We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals.
arXiv Detail & Related papers (2020-12-03T15:46:43Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.