Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts
- URL: http://arxiv.org/abs/2507.23128v1
- Date: Wed, 30 Jul 2025 22:14:16 GMT
- Title: Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts
- Authors: Anaïs Baranger, Lucas Maison,
- Abstract summary: We investigate how training conditions and input features affect the robustness and generalization abilities of spoken keyword classifiers under OOD conditions.<n>Our results suggest that noise-aware training improves in some configurations.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although prior work in computer vision has shown strong correlations between in-distribution (ID) and out-of-distribution (OOD) accuracies, such relationships remain underexplored in audio-based models. In this study, we investigate how training conditions and input features affect the robustness and generalization abilities of spoken keyword classifiers under OOD conditions. We benchmark several neural architectures across a variety of evaluation sets. To quantify the impact of noise on generalization, we make use of two metrics: Fairness (F), which measures overall accuracy gains compared to a baseline model, and Robustness (R), which assesses the convergence between ID and OOD performance. Our results suggest that noise-aware training improves robustness in some configurations. These findings shed new light on the benefits and limitations of noise-based augmentation for generalization in speech models.
Related papers
- Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments [5.8166742412657895]
Deep neural networks (DNNs) have achieved remarkable success across diverse domains, but their performance can be severely degraded by noisy or corrupted training data.<n>We propose a novel framework that integrates attribution-guided data partitioning, discriminative neuron pruning, and targeted fine-tuning to mitigate the impact of noisy samples.<n>Our framework achieves approximately a 10% absolute accuracy improvement over standard retraining on CIFAR-10 with injected label noise.
arXiv Detail & Related papers (2025-06-13T09:37:11Z) - Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations [39.31175048498422]
This study proposes Voice Quality Assessment Network (VOQANet), a deep learning-based framework with an attention mechanism to extract high-level acoustic and prosodic information from raw speech.<n>To enhance interpretability, we also introduce VOQANet+, which integrates low-level speech descriptors such as jitter, shimmer, and harmonics-to-noise ratio (HNR) with SFM embeddings into a hybrid representation.<n>Results show that sentence-based input outperforms vowel-based input, especially at the patient level, underscoring the value of longer utterances for capturing voice attributes.
arXiv Detail & Related papers (2025-05-27T15:48:17Z) - $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Enhance Vision-Language Alignment with Noise [59.2608298578913]
We investigate whether the frozen model can be fine-tuned by customized noise.<n>We propose Positive-incentive Noise (PiNI) which can fine-tune CLIP via injecting noise into both visual and text encoders.
arXiv Detail & Related papers (2024-12-14T12:58:15Z) - Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward [10.914414815406275]
Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities.<n>Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions.<n>We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture.
arXiv Detail & Related papers (2024-11-06T12:22:04Z) - Assessing the Generalization Gap of Learning-Based Speech Enhancement
Systems in Noisy and Reverberant Environments [0.7366405857677227]
Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or room impulse response database.
The present study introduces a generalization assessment framework that uses a reference model trained on the test condition.
The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), ConvTasNet, DCCRN and MANNER.
arXiv Detail & Related papers (2023-09-12T12:51:12Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Certified Adversarial Defenses Meet Out-of-Distribution Corruptions:
Benchmarking Robustness and Simple Baselines [65.0803400763215]
This work critically examines how adversarial robustness guarantees change when state-of-the-art certifiably robust models encounter out-of-distribution data.
We propose a novel data augmentation scheme, FourierMix, that produces augmentations to improve the spectral coverage of the training data.
We find that FourierMix augmentations help eliminate the spectral bias of certifiably robust models enabling them to achieve significantly better robustness guarantees on a range of OOD benchmarks.
arXiv Detail & Related papers (2021-12-01T17:11:22Z) - A Frequency Perspective of Adversarial Robustness [72.48178241090149]
We present a frequency-based understanding of adversarial examples, supported by theoretical and empirical findings.
Our analysis shows that adversarial examples are neither in high-frequency nor in low-frequency components, but are simply dataset dependent.
We propose a frequency-based explanation for the commonly observed accuracy vs. robustness trade-off.
arXiv Detail & Related papers (2021-10-26T19:12:34Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.