Comparative Study on Noise-Augmented Training and its Effect on Adversarial Robustness in ASR Systems
- URL: http://arxiv.org/abs/2409.01813v4
- Date: Fri, 07 Nov 2025 17:07:19 GMT
- Title: Comparative Study on Noise-Augmented Training and its Effect on Adversarial Robustness in ASR Systems
- Authors: Karla Pizzi, MatÃas Pizarro, Asja Fischer,
- Abstract summary: We conduct a comparative analysis of the adversarial robustness of four different ASR architectures.<n>We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples.<n>Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model's robustness to adversarial attacks.
- Score: 11.345811825870504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition (ASR) systems. We conduct a comparative analysis of the adversarial robustness of four different ASR architectures, each trained under three different augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples. Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model's robustness to adversarial attacks.
Related papers
- Training-Free Intelligibility-Guided Observation Addition for Noisy ASR [57.74127683005929]
This paper proposes an intelligibility-guided observation addition (OA) method to improve speech recognition in noisy environments.<n>Experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines.
arXiv Detail & Related papers (2026-02-24T14:46:54Z) - Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts [0.0]
We investigate how training conditions and input features affect the robustness and generalization abilities of spoken keyword classifiers under OOD conditions.<n>Our results suggest that noise-aware training improves in some configurations.
arXiv Detail & Related papers (2025-07-30T22:14:16Z) - Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization [4.720552406377147]
We propose a technique that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models.
Our method is plug-and-play and can be integrated with any existing attack methods.
arXiv Detail & Related papers (2025-03-25T12:14:10Z) - Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models.
We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z) - Robust VAEs via Generating Process of Noise Augmented Data [9.366139389037489]
This paper introduces a novel framework that enhances robustness by regularizing the latent space divergence between original and noise-augmented data.
Our empirical evaluations demonstrate that this approach, termed Robust Augmented Variational Auto-ENcoder (RAVEN), yields superior performance in resisting adversarial inputs.
arXiv Detail & Related papers (2024-07-26T09:55:34Z) - An Integrated Algorithm for Robust and Imperceptible Audio Adversarial
Examples [2.2866551516539726]
A viable adversarial audio file is produced, then, this is fine-tuned with respect to perceptibility and robustness.
We present an integrated algorithm that uses psychoacoustic models and room impulse responses (RIR) in the generation step.
arXiv Detail & Related papers (2023-10-05T06:59:09Z) - On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition [6.006652562747009]
We investigate a joint ASR-SER learning approach in a low-resource setting.
Joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively.
Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
arXiv Detail & Related papers (2023-05-21T18:52:21Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora [70.46867541361982]
We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL.
We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
arXiv Detail & Related papers (2021-09-23T00:43:32Z) - Evaluation of Deep-Learning-Based Voice Activity Detectors and Room
Impulse Response Models in Reverberant Environments [13.558688470594676]
State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data.
We simulate an augmented training set that contains nearly five million utterances.
We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set.
arXiv Detail & Related papers (2021-06-25T09:05:38Z) - An Investigation of End-to-End Models for Robust Speech Recognition [20.998349142078805]
We present a comparison of speech enhancement-based techniques and three different model-based adaptation techniques for robust automatic speech recognition.
While adversarial learning is the best-performing technique on certain noise types, it comes at the cost of degrading clean speech WER.
On other relatively stationary noise types, a new speech enhancement technique outperformed all the model-based adaptation techniques.
arXiv Detail & Related papers (2021-02-11T19:47:13Z) - From Sound Representation to Model Robustness [82.21746840893658]
We investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
Averaged over various experiments on three environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures.
arXiv Detail & Related papers (2020-07-27T17:30:49Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Improving noise robust automatic speech recognition with single-channel
time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance.
We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.