Towards Hierarchical Spoken Language Dysfluency Modeling
- URL: http://arxiv.org/abs/2401.10015v2
- Date: Sun, 21 Jan 2024 06:51:25 GMT
- Title: Towards Hierarchical Spoken Language Dysfluency Modeling
- Authors: Jiachen Lian and Gopala Anumanchipalli
- Abstract summary: Speech disfluency modeling is the bottleneck for both speech therapy and language learning.
We present Hierarchical Unconstrained Disfluency Modeling (H-UDM) approach, the hierarchical extension of UDM.
Our experimental findings serve as clear evidence of the effectiveness and reliability of the methods we have introduced.
- Score: 8.45042473491412
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech disfluency modeling is the bottleneck for both speech therapy and
language learning. However, there is no effective AI solution to systematically
tackle this problem. We solidify the concept of disfluent speech and disfluent
speech modeling. We then present Hierarchical Unconstrained Disfluency Modeling
(H-UDM) approach, the hierarchical extension of UDM that addresses both
disfluency transcription and detection to eliminate the need for extensive
manual annotation. Our experimental findings serve as clear evidence of the
effectiveness and reliability of the methods we have introduced, encompassing
both transcription and detection tasks.
Related papers
- Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching [0.0]
Dysarthria is a neurological disorder that significantly impairs speech intelligibility.<n>This necessitates the development of robust dysarthric-to-regular speech conversion techniques.
arXiv Detail & Related papers (2025-06-19T08:24:17Z) - Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis [8.5693791544413]
We propose Neural LCS, a novel approach for dysfluent text-text and speech-text alignment.<n>We evaluate our method on a large-scale simulated dataset.<n>Our results demonstrate the potential of Neural LCS to enhance automated systems for diagnosing and analyzing speech disorders.
arXiv Detail & Related papers (2025-06-05T03:06:37Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection [5.512072120303165]
Dysfluent-WFST is a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency.<n>It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data.
arXiv Detail & Related papers (2025-05-22T08:02:50Z) - Devising a Set of Compact and Explainable Spoken Language Feature for Screening Alzheimer's Disease [52.46922921214341]
Alzheimer's disease (AD) has become one of the most significant health challenges in an aging society.
We devised an explainable and effective feature set that leverages the visual capabilities of a large language model (LLM) and the Term Frequency-Inverse Document Frequency (TF-IDF) model.
Our new features can be well explained and interpreted step by step which enhance the interpretability of automatic AD screening.
arXiv Detail & Related papers (2024-11-28T05:23:22Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - Automatic Disfluency Detection from Untranscribed Speech [25.534535098405602]
Stuttering is a speech disorder characterized by a high rate of disfluencies.
automatic disfluency detection may help in treatment planning for individuals who stutter.
We investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization.
arXiv Detail & Related papers (2023-11-01T21:36:39Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Weakly-supervised forced alignment of disfluent speech using
phoneme-level modeling [10.283092375534311]
We propose a simple and effective modification of alignment graph construction using weighted Finite State Transducers.
The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment.
Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements.
arXiv Detail & Related papers (2023-05-30T09:57:36Z) - DisfluencyFixer: A tool to enhance Language Learning through Speech To
Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi.
Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - Streaming Joint Speech Recognition and Disfluency Detection [30.018034246393725]
We propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection.
Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors.
We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency.
arXiv Detail & Related papers (2022-11-16T07:34:20Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - End-to-End Speech Recognition and Disfluency Removal [15.910282983166024]
This paper investigates the task of end-to-end speech recognition and disfluency removal.
We show that end-to-end models do learn to directly generate fluent transcripts.
We propose two new metrics that can be used for evaluating integrated ASR and disfluency models.
arXiv Detail & Related papers (2020-09-22T03:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.