DRES: Benchmarking LLMs for Disfluency Removal
- URL: http://arxiv.org/abs/2509.20321v1
- Date: Wed, 24 Sep 2025 17:08:12 GMT
- Title: DRES: Benchmarking LLMs for Disfluency Removal
- Authors: Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, James Caverlee,
- Abstract summary: Disfluencies, such as "um," "uh," interjections, parentheticals, and edited statements, remain a persistent challenge for speech-driven systems.<n>Disfluency Removal Evaluation Suite, a controlled text-level benchmark, establishes a reproducible semantic upper bound for this task.<n>DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability.
- Score: 27.083825614818135
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.
Related papers
- ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models [12.527207210862151]
ReHear is a framework for iterative pseudo-label refinement in automatic speech recognition.<n>It integrates an instruction-tuned, audio-aware large language model into the self-training loop.<n>We show that ReHear effectively mitigates error propagation, consistently outperforming both supervised and pseudo-labeling baselines.
arXiv Detail & Related papers (2026-02-21T05:04:22Z) - Unlocking the Potentials of Retrieval-Augmented Generation for Diffusion Language Models [38.148737920360766]
Retrieval-Augmented Generation shows great successes for enhancing large language models (LLMs)<n>We show that DLMs coupled with RAG show promising potentials with stronger dependency on contextual information, but suffer from limited generation precision.<n>We propose Semantic-Preserving REtrieval-Augmented Diffusion (SPREAD), a novel framework that introduces a query-relevance-guided denoising strategy.
arXiv Detail & Related papers (2026-01-16T14:45:46Z) - SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z) - Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.<n>Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.<n>We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.<n> Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - Exploring Iterative Controllable Summarization with Large Language Models [22.80433394369022]
Large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks.<n>Our findings show that LLMs struggle more with numerical attributes than with linguistic attributes.<n>We propose a guide-to-explain framework (GTE) for controllable summarization.
arXiv Detail & Related papers (2024-11-19T12:36:02Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - End-to-End Speech Recognition and Disfluency Removal [15.910282983166024]
This paper investigates the task of end-to-end speech recognition and disfluency removal.
We show that end-to-end models do learn to directly generate fluent transcripts.
We propose two new metrics that can be used for evaluating integrated ASR and disfluency models.
arXiv Detail & Related papers (2020-09-22T03:11:37Z) - Robust Spoken Language Understanding with RL-based Value Error Recovery [35.82890898452309]
Spoken Language Understanding (SLU) aims to extract structured semantic representations (e.g., slot-value pairs) from speech recognized texts.
We propose a new robust SLU framework to guide the SLU input adaptation with a rule-based value error recovery module.
Experiments on the public CATSLU dataset show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-09-07T13:32:07Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.