FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning
- URL: http://arxiv.org/abs/2009.11394v1
- Date: Wed, 23 Sep 2020 21:51:29 GMT
- Title: FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning
- Authors: Tedd Kourkounakis, Amirhossein Hajavi, Ali Etemad
- Abstract summary: We propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types.
FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations.
We present a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters.
- Score: 23.13972240042859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Strong presentation skills are valuable and sought-after in workplace and
classroom environments alike. Of the possible improvements to vocal
presentations, disfluencies and stutters in particular remain one of the most
common and prominent factors of someone's demonstration. Millions of people are
affected by stuttering and other speech disfluencies, with the majority of the
world having experienced mild stutters while communicating under stressful
conditions. While there has been much research in the field of automatic speech
recognition and language models, there lacks the sufficient body of work when
it comes to disfluency detection and recognition. To this end, we propose an
end-to-end deep neural network, FluentNet, capable of detecting a number of
different disfluency types. FluentNet consists of a Squeeze-and-Excitation
Residual convolutional neural network which facilitate the learning of strong
spectral frame-level representations, followed by a set of bidirectional long
short-term memory layers that aid in learning effective temporal relationships.
Lastly, FluentNet uses an attention mechanism to focus on the important parts
of speech to obtain a better performance. We perform a number of different
experiments, comparisons, and ablation studies to evaluate our model. Our model
achieves state-of-the-art results by outperforming other solutions in the field
on the publicly available UCLASS dataset. Additionally, we present
LibriStutter: a disfluency dataset based on the public LibriSpeech dataset with
synthesized stutters. We also evaluate FluentNet on this dataset, showing the
strong performance of our model versus a number of benchmark techniques.
Related papers
- Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - SAFL: A Self-Attention Scene Text Recognizer with Focal Loss [4.462730814123762]
Scene text recognition remains challenging due to inherent problems such as distortions or irregular layout.
Most of the existing approaches mainly leverage recurrence or convolution-based neural networks.
We introduce SAFL, a self-attention-based neural network model with the focal loss for scene text recognition.
arXiv Detail & Related papers (2022-01-01T06:51:03Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for
Voice Activity Detection [30.46050153776374]
Voice activity detection (VAD) makes a distinction between speech and non-speech.
Deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods.
This paper proposes an adaptive multiple receptive-field attention neural network, called MLNET, to finish VAD task.
arXiv Detail & Related papers (2020-08-13T02:24:28Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.