Stuttering-Aware Automatic Speech Recognition for Indonesian Language
- URL: http://arxiv.org/abs/2601.03727v2
- Date: Wed, 14 Jan 2026 07:30:02 GMT
- Title: Stuttering-Aware Automatic Speech Recognition for Indonesian Language
- Authors: Fadhil Muhammad, Alwin Djuliansah, Adrian Aryaputra Hamzah, Kurniawati Azizah,
- Abstract summary: We propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text.<n>We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning.<n>Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments.
- Score: 0.04666493857924358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.
Related papers
- ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis [3.1848820580333737]
We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis.<n>We propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech.
arXiv Detail & Related papers (2026-03-04T16:04:02Z) - Improving Code-Switching Speech Recognition with TTS Data Augmentation [58.34842693152991]
This paper explores multilingual text-to-speech (TTS) models as an effective data augmentation technique to address this shortage.<n>We fine-tune the multilingual CosyVoice2 TTS model on the SEAME dataset to generate synthetic conversational Chinese-English code-switching speech.
arXiv Detail & Related papers (2026-01-02T10:11:51Z) - Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition [8.948233216872211]
Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages.<n>This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly.<n>We introduce a novel data augmentation technique for speech corpora designed to mitigate this gap.
arXiv Detail & Related papers (2025-11-25T17:35:57Z) - MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - Speechless: Speech Instruction Training Without Speech for Low Resource Languages [14.223895501862811]
A scarcity of speech instruction data is essential for fine-tuning large language models to understand and execute spoken commands.<n>Our novel approach addresses this challenge by halting synthesis at the semantic representation level, bypassing the need for TTS.<n>We achieve this by aligning synthetic semantic representations with the pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text instructions while maintaining the ability to understand spoken instructions during inference.
arXiv Detail & Related papers (2025-05-23T03:05:47Z) - Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages [49.31519786009296]
We fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions.<n>We then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.<n>The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingually Speech (MMS)
arXiv Detail & Related papers (2025-05-20T20:03:45Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition [31.58289343561422]
We compare five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training.
For data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
arXiv Detail & Related papers (2024-07-31T09:37:27Z) - Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns.
Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.