Related papers: STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter

STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter

URL: http://arxiv.org/abs/2601.10223v1
Date: Thu, 15 Jan 2026 09:36:18 GMT
Title: STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter
Authors: Ziqi Xu, Yi Liu, Yuekang Li, Ling Shi, Kailong Wang, Yongxin Zhao,
Abstract summary: We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline.<n>Our approach addresses three critical technical challenges: (1) the difficulty of direct speech to speech conversion for disfluent input, (2) semantic distortions introduced during ASR transcription of stuttered speech, and (3) latency constraints for real time communication.
Score: 17.65146066814439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: People who stutter (PWS) face systemic exclusion in today's voice-driven society, where access to voice assistants, authentication systems, and remote work tools increasingly depends on fluent speech. Current automatic speech recognition (ASR) systems, trained predominantly on fluent speech, fail to serve millions of PWS worldwide. We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline. Our approach addresses three critical technical challenges: (1) the difficulty of direct speech to speech conversion for disfluent input, (2) semantic distortions introduced during ASR transcription of stuttered speech, and (3) latency constraints for real time communication. STEAMROLLER employs a three stage architecture comprising ASR transcription, multi-agent text repair, and speech synthesis, where our core innovation lies in a collaborative multi-agent framework that iteratively refines transcripts while preserving semantic intent. Experiments on the FluencyBank dataset and a user study demonstrates clear word error rate (WER) reduction and strong user satisfaction. Beyond immediate accessibility benefits, fine tuning ASR on STEAMROLLER repaired speech further yields additional WER improvements, creating a pathway toward inclusive AI ecosystems.

Related papers

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage [66.67531241554546]
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines.<n>We introduce the first approach to extend tool use directly into speech-in speech-out systems.<n>We propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech.
arXiv Detail & Related papers (2025-10-02T14:18:20Z)
KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI [14.667102744113295]
Real-time speech-to-speech (S2S) models excel at generating low-latency conversational responses but often lack deep knowledge and semantic understanding.<n>C cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency.<n>This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms.
arXiv Detail & Related papers (2025-09-26T00:46:34Z)
From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach [0.0]
We propose an automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing.<n> Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.
arXiv Detail & Related papers (2025-09-02T16:13:29Z)
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent? [3.7311680121118345]
We propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context. We achieve a 59% relative reduction in WER from an unmodified ASR system.
arXiv Detail & Related papers (2022-10-21T11:16:05Z)
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z)
On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems. For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep. We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.