Related papers: When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems

When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems

URL: http://arxiv.org/abs/2512.17562v1
Date: Fri, 19 Dec 2025 13:32:19 GMT
Title: When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems
Authors: Sujal Chondhekar, Vasanth Murukuri, Rushabh Vasani, Sanika Goyal, Rajshree Badami, Anushree Rana, Sanjana SN, Karthik Pandia, Sulabh Katiyar, Neha Jagadeesh, Sankalp Gulati,
Abstract summary: Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments.<n>We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems.<n>Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models.
Score: 0.6158894274166716
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noisy data. We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. ASR performance is measured using semantic WER (semWER), a normalized word error rate (WER) metric accounting for domain-specific normalizations. Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models. Original noisy audio achieves lower semWER than enhanced audio in all 40 tested configurations (4 models x 10 conditions), with degradations ranging from 1.1% to 46.6% absolute semWER increase. These findings suggest that modern ASR models possess sufficient internal noise robustness and that traditional speech enhancement may remove acoustic features critical for ASR. For practitioners deploying medical scribe systems in noisy clinical environments, our results indicate that preprocessing audio with noise reduction techniques might not just be computationally wasteful but also be potentially harmful to the transcription accuracy.

Related papers

When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper [0.0]
We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, when used as a preprocessing step for zero-shot transcription with Whisper.<n> Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance.<n>These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition.
arXiv Detail & Related papers (2026-03-05T01:20:11Z)
Training-Free Intelligibility-Guided Observation Addition for Noisy ASR [57.74127683005929]
This paper proposes an intelligibility-guided observation addition (OA) method to improve speech recognition in noisy environments.<n>Experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines.
arXiv Detail & Related papers (2026-02-24T14:46:54Z)
Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion [1.376408511310322]
Speech quality and intelligibility are significantly degraded in noisy environments.<n>This paper presents a novel transformer-based learning framework to address the single-channel noise suppression problem.
arXiv Detail & Related papers (2025-11-14T19:27:42Z)
Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding [26.98755758066905]
We train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems. We propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system.
arXiv Detail & Related papers (2024-10-21T03:13:22Z)
Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models. We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z)
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech. We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z)
On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition [23.812838405442953]
We propose an efficient attempt to noisy speech emotion recognition (NSER)<n>We adopt the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech.<n>Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
arXiv Detail & Related papers (2023-11-13T05:45:55Z)
Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies. This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z)
Investigation of Data Augmentation Techniques for Disordered Speech Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition. Both normal and disordered speech were exploited in the augmentation process. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Improving noise robust automatic speech recognition with single-channel time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance. We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.