ASR Under Noise: Exploring Robustness for Sundanese and Javanese
- URL: http://arxiv.org/abs/2509.25878v1
- Date: Tue, 30 Sep 2025 07:20:25 GMT
- Title: ASR Under Noise: Exploring Robustness for Sundanese and Javanese
- Authors: Salsabila Zahirah Pranida, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Shady Shehata,
- Abstract summary: We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese.<n>We experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs)<n>Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models.
- Score: 2.839588958814753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements
Related papers
- When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper [0.0]
We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, when used as a preprocessing step for zero-shot transcription with Whisper.<n> Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance.<n>These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition.
arXiv Detail & Related papers (2026-03-05T01:20:11Z) - Training-Free Intelligibility-Guided Observation Addition for Noisy ASR [57.74127683005929]
This paper proposes an intelligibility-guided observation addition (OA) method to improve speech recognition in noisy environments.<n>Experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines.
arXiv Detail & Related papers (2026-02-24T14:46:54Z) - Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining [21.26555178371168]
Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame.<n>Deep neural network-based models have shown good performance in this task.<n>We propose a causal, Self-Supervised Learning (SSL) pretraining framework to enhance TS-VAD performance in noisy conditions.
arXiv Detail & Related papers (2025-01-06T18:00:14Z) - Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models.
We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z) - Reassessing Noise Augmentation Methods in the Context of Adversarial Speech [12.488332326259469]
We investigate if noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition systems.
The results demonstrate that noise augmentation not only improves model performance on noisy speech but also the model's robustness to adversarial attacks.
arXiv Detail & Related papers (2024-09-03T11:51:10Z) - Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training [39.21885486667879]
Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges, including hallucination, outdated knowledge, and untraceable reasoning processes.
Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges.
We propose a novel RAG approach known as Retrieval-augmented Adaptive Adrial Training (RAAT)
arXiv Detail & Related papers (2024-05-31T16:24:53Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition [6.006652562747009]
We investigate a joint ASR-SER learning approach in a low-resource setting.
Joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively.
Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
arXiv Detail & Related papers (2023-05-21T18:52:21Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Addressing the Vulnerability of NMT in Input Perturbations [10.103375853643547]
We improve robustness of NMT models by reducing the effect of noisy words through a Context-Enhanced Reconstruction approach.
CER trains the model to resist noise in two steps: (1) step that breaks the naturalness of input sequence with made-up words; (2) reconstruction step that defends the noise propagation by generating better and more robust contextual representation.
arXiv Detail & Related papers (2021-04-20T07:52:58Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.