Related papers: YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

URL: http://arxiv.org/abs/2408.15297v3
Date: Sun, 15 Sep 2024 06:20:19 GMT
Title: YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
Authors: Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini, Jiachen Lian, Gopala Krishna Anumanchipalli,
Abstract summary: YOLO-Stutter is a first end-to-end method that detects dysfluencies in a time-accurate manner. VCTK-Stutter and VCTK-TTS simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation.
Score: 5.42845980208244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter

Related papers

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow [57.51550409392103]
We introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. To address these challenges, we decompose the speech signal into manageable subspaces, each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture.
arXiv Detail & Related papers (2024-11-29T05:55:20Z)
Augmenting Automatic Speech Recognition Models with Disfluency Detection [12.45703869323415]
Speech disfluency commonly occurs in conversational and spontaneous speech. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. We present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies.
arXiv Detail & Related papers (2024-09-16T11:13:14Z)
Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection [4.126904442587873]
Stutter-r: an end-to-end framework that detects dysfluency with accurate type and time transcription. VCTK-Pro, VCTK-Art, and AISHELL3-Pro, simulating natural spoken dysfluencies. Our approach achieves state-of-the-art performance on all available dysfluency corpora.
arXiv Detail & Related papers (2024-09-15T06:11:00Z)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods. We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
StutterNet: Stuttering Detection Using Time Delay Neural Network [9.726119468893721]
This paper introduce StutterNet, a novel deep learning based stuttering detection system. We use a time-delay neural network (TDNN) suitable for capturing contextual aspects of the disfluent utterances. Our method achieves promising results and outperforms the state-of-the-art residual neural network based method.
arXiv Detail & Related papers (2021-05-12T11:36:01Z)
FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning [23.13972240042859]
We propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types. FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations. We present a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters.
arXiv Detail & Related papers (2020-09-23T21:51:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.