Related papers: Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts

Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts

URL: http://arxiv.org/abs/2506.18510v1
Date: Mon, 23 Jun 2025 11:04:20 GMT
Title: Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts
Authors: Duygu Altinok,
Abstract summary: Large language models (LLMs) are versatile learners capable of processing both lexical and non-lexical inputs.<n>We propose a novel approach to disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts.
Score: 5.439020425819001
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Accurate detection of disfluencies in spoken language is crucial for enhancing the performance of automatic speech and language processing systems, as well as fostering the development of more inclusive speech and language technologies. Leveraging the growing trend of large language models (LLMs) as versatile learners capable of processing both lexical and non-lexical inputs (e.g., audio and video), we propose a novel approach to transcribing disfluencies as explicit tokens with timestamps, enabling the generation of fully annotated disfluency-rich transcripts. Our method integrates acoustic representations extracted from an audio encoder with textual inputs of varying quality: clean transcriptions without disfluencies, time-aligned transcriptions from aligners, or outputs from phoneme-based ASR models -- all of which may contain imperfections. Importantly, our experiments demonstrate that textual inputs do not need to be flawless. As long as they include timestamp-related cues, LLMs can effectively smooth the input and produce fully disfluency-annotated transcripts, underscoring their robustness in handling imperfect hints.

Related papers

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
We introduce LISTEN, a contrastive-like training method designed to improve ALLMs' ability to distinguish between present and absent sounds.<n>We also extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption.<n> Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z)
Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection [5.512072120303165]
Dysfluent-WFST is a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency.<n>It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data.
arXiv Detail & Related papers (2025-05-22T08:02:50Z)
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z)
DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [45.791472119671916]
Spoken language models (SLMs) process text and speech, enabling simultaneous speech understanding and generation. DC-Spin aims to improve speech tokenization by bridging audio signals and SLM tokens. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation.
arXiv Detail & Related papers (2024-10-31T17:43:13Z)
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Large Language Models for Dysfluency Detection in Stuttered Speech [16.812800649507302]
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech.
arXiv Detail & Related papers (2024-06-16T17:51:22Z)
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z)
DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi. Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z)
MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation [27.19320167337675]
We propose a technique to learn a robust speech encoder in a self-supervised fashion only on the speech side. This technique termed Masked Acoustic Modeling (MAM) not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals. In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training.
arXiv Detail & Related papers (2020-10-22T05:02:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.