Related papers: On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

URL: http://arxiv.org/abs/2512.02027v1
Date: Tue, 18 Nov 2025 19:33:29 GMT
Title: On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts
Authors: Kashaf Gulzar, Dominik Wagner, Sebastian P. Bayerl, Florian Hönig, Tobias Bocklet, Korbinian Riedhammer,
Abstract summary: Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value.<n>We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions.<n>Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR.
Score: 21.253980895817634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.

Related papers

HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation [19.997594859651233]
HENT-SRT is a novel framework that factorizes ASR and translation tasks to better handle reordering.<n>We improve computational efficiency by incorporating best practices from ASR transducers.<n>Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin.
arXiv Detail & Related papers (2025-06-02T18:37:50Z)
Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection [5.512072120303165]
Dysfluent-WFST is a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency.<n>It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data.
arXiv Detail & Related papers (2025-05-22T08:02:50Z)
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages [49.31519786009296]
We fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions.<n>We then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.<n>The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingually Speech (MMS)
arXiv Detail & Related papers (2025-05-20T20:03:45Z)
LANDeRMT: Detecting and Routing Language-Aware Neurons for Selectively Finetuning LLMs to Machine Translation [43.26446958873554]
Large language models (LLMs) have shown promising results in multilingual translation even with limited bilingual supervision. Recent advancements in large language models (LLMs) have shown promising results in multilingual translation even with limited bilingual supervision. LandeRMT is a framework that selectively finetunes LLMs to textbfMachine textbfTranslation with diverse translation training data.
arXiv Detail & Related papers (2024-09-29T02:39:42Z)
Automatic Disfluency Detection from Untranscribed Speech [25.534535098405602]
Stuttering is a speech disorder characterized by a high rate of disfluencies. automatic disfluency detection may help in treatment planning for individuals who stutter. We investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization.
arXiv Detail & Related papers (2023-11-01T21:36:39Z)
Generative error correction for code-switching speech recognition using large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z)
Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages. Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z)
DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi. Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z)
Phrase-level Adversarial Example Generation for Neural Machine Translation [75.01476479100569]
We propose a phrase-level adversarial example generation (PAEG) method to enhance the robustness of the model. We verify our method on three benchmarks, including LDC Chinese-English, IWSLT14 German-English, and WMT14 English-German tasks.
arXiv Detail & Related papers (2022-01-06T11:00:49Z)
Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction. It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition. We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z)
Sentence Boundary Augmentation For Neural Machine Translation Robustness [11.290581889247983]
We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness. We show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
arXiv Detail & Related papers (2020-10-21T16:44:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.