Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection
- URL: http://arxiv.org/abs/2505.16351v2
- Date: Sun, 25 May 2025 01:02:29 GMT
- Title: Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection
- Authors: Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Hwi Joo Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli,
- Abstract summary: Dysfluent-WFST is a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency.<n>It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data.
- Score: 5.512072120303165
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.
Related papers
- Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis [8.5693791544413]
We propose Neural LCS, a novel approach for dysfluent text-text and speech-text alignment.<n>We evaluate our method on a large-scale simulated dataset.<n>Our results demonstrate the potential of Neural LCS to enhance automated systems for diagnosing and analyzing speech disorders.
arXiv Detail & Related papers (2025-06-05T03:06:37Z) - Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection [5.95376852691752]
Speech dysfluency detection is crucial for clinical diagnosis and language assessment.<n>This dataset captures 11 dysfluency categories spanning both word and phoneme levels.<n>Building upon this resource, we improve an end-to-end dysfluency detection framework.
arXiv Detail & Related papers (2025-05-28T06:52:10Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Towards Hierarchical Spoken Language Dysfluency Modeling [8.45042473491412]
Speech disfluency modeling is the bottleneck for both speech therapy and language learning.
We present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM.
Our experimental findings serve as clear evidence of the effectiveness and reliability of the methods we have introduced.
arXiv Detail & Related papers (2024-01-18T14:33:01Z) - Automatic Disfluency Detection from Untranscribed Speech [25.534535098405602]
Stuttering is a speech disorder characterized by a high rate of disfluencies.
automatic disfluency detection may help in treatment planning for individuals who stutter.
We investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization.
arXiv Detail & Related papers (2023-11-01T21:36:39Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Streaming Joint Speech Recognition and Disfluency Detection [30.018034246393725]
We propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection.
Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors.
We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency.
arXiv Detail & Related papers (2022-11-16T07:34:20Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.