Related papers: Revisiting Acoustic Features for Robust ASR

Revisiting Acoustic Features for Robust ASR

URL: http://arxiv.org/abs/2409.16399v1
Date: Tue, 24 Sep 2024 18:58:23 GMT
Title: Revisiting Acoustic Features for Robust ASR
Authors: Muhammad A. Shah, Bhiksha Raj,
Abstract summary: We revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception. We propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression.
Score: 25.687120601256787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic Speech Recognition (ASR) systems must be robust to the myriad types of noises present in real-world environments including environmental noise, room impulse response, special effects as well as attacks by malicious actors (adversarial attacks). Recent works seek to improve accuracy and robustness by developing novel Deep Neural Networks (DNNs) and curating diverse training datasets for them, while using relatively simple acoustic features. While this approach improves robustness to the types of noise present in the training data, it confers limited robustness against unseen noises and negligible robustness to adversarial attacks. In this paper, we revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception that could be used to perform accurate and robust ASR. In contrast, Specifically, we evaluate the ASR accuracy and robustness of several biologically inspired acoustic features. In addition to several features from prior works, such as gammatone filterbank features (GammSpec), we also propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression. Experiments on diverse models and datasets show that (1) DoGSpec achieves significantly better robustness than the highly popular log mel spectrogram (LogMelSpec) with minimal accuracy degradation, and (2) GammSpec achieves better accuracy and robustness to non-adversarial noises from the Speech Robust Bench benchmark, but it is outperformed by DoGSpec against adversarial attacks.

Related papers

Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP [68.44229678548298]
Contrastive Spectral Rectification (CSR) is an efficient test-time defense against adversarial examples.<n>CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack.<n>CSR exhibits broad applicability across diverse visual tasks.
arXiv Detail & Related papers (2026-01-27T05:24:45Z)
Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition [2.0391237204597363]
Speech Emotion Recognition systems often degrade in performance when exposed to unpredictable acoustic interference.<n>We propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks.
arXiv Detail & Related papers (2025-12-20T10:05:58Z)
Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers [40.4026420070893]
We introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features.<n>HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise.<n>To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types.
arXiv Detail & Related papers (2025-08-04T08:15:16Z)
Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution [19.32372029477596]
We apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system.<n>By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains.
arXiv Detail & Related papers (2025-06-02T19:11:16Z)
Measuring the Robustness of Audio Deepfake Detectors [59.09338266364506]
This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions. Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations.
arXiv Detail & Related papers (2025-03-21T23:21:17Z)
Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis [6.858439600092057]
We explore the use of prosody, or the high-level linguistic features of human speech, as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models. We show that we can explain the prosodic features that have highest impact on the model's decision.
arXiv Detail & Related papers (2025-02-20T16:52:55Z)
A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy. We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods. By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z)
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models. In this paper, we investigate how detection performance varies across model backbones, types, and datasets. We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z)
DEMONet: Underwater Acoustic Target Recognition based on Multi-Expert Network and Cross-Temporal Variational Autoencoder [22.271499386492533]
Building a robust underwater acoustic recognition system in real-world scenarios is challenging due to the complex underwater environment. We propose DEMONet, which utilizes the detection of envelope modulation on noise (DEMON) to provide robust insights into the shaft frequency or blade counts of targets. To mitigate noise and spurious modulation spectra in DEMON features, we introduce a cross-temporal alignment strategy and employ a variational autoencoder (VAE) to reconstruct noise-resistant DEMON spectra to replace the raw DEMON features.
arXiv Detail & Related papers (2024-11-05T03:04:51Z)
Filtered Randomized Smoothing: A New Defense for Robust Modulation Classification [16.974803642923465]
We study the problem of designing robust modulation classifiers that can provide provable defense against arbitrary attacks. We propose Filtered Randomized Smoothing (FRS), a novel defense which combines spectral filtering together with randomized smoothing. We show that FRS significantly outperforms existing defenses including AT and RS in terms of accuracy on both attacked and benign signals.
arXiv Detail & Related papers (2024-10-08T20:17:25Z)
A Spectral Perspective towards Understanding and Improving Adversarial Robustness [8.912245110734334]
adversarial training (AT) has proven to be an effective defense approach, but mechanism for robustness improvement is not fully understood. We show that AT induces the deep model to focus more on the low-frequency region, which retains the shape-biased representations, to gain robustness. We propose a spectral alignment regularization (SAR) such that the spectral output inferred by an attacked adversarial input stays as close as possible to its natural input counterpart.
arXiv Detail & Related papers (2023-06-25T14:47:03Z)
Improve Noise Tolerance of Robust Loss via Noise-Awareness [60.34670515595074]
We propose a meta-learning method which is capable of adaptively learning a hyper parameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster for brevity) Four SOTA robust loss functions are attempted to be integrated with our algorithm, and comprehensive experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and performance.
arXiv Detail & Related papers (2023-01-18T04:54:58Z)
Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise [18.19207291891767]
adversarial attacks against deep ASR systems are highly successful. This work leverages filter bank-based features to better capture the characteristics of attacks for improved detection. Inverse filter bank features generally perform better in both clean and noisy environments.
arXiv Detail & Related papers (2022-11-03T07:25:45Z)
SAR Despeckling using a Denoising Diffusion Probabilistic Model [52.25981472415249]
The presence of speckle degrades the image quality and adversely affects the performance of SAR image understanding applications. We introduce SAR-DDPM, a denoising diffusion probabilistic model for SAR despeckling. The proposed method achieves significant improvements in both quantitative and qualitative results over the state-of-the-art despeckling methods.
arXiv Detail & Related papers (2022-06-09T14:00:26Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
Robustifying automatic speech recognition by extracting slowly varying features [16.74051650034954]
We propose a defense mechanism against targeted adversarial attacks. We use hybrid ASR models trained on data pre-processed in such a way. Our model shows a performance on clean data similar to the baseline model, while being more than four times more robust.
arXiv Detail & Related papers (2021-12-14T13:50:23Z)
Certified Adversarial Defenses Meet Out-of-Distribution Corruptions: Benchmarking Robustness and Simple Baselines [65.0803400763215]
This work critically examines how adversarial robustness guarantees change when state-of-the-art certifiably robust models encounter out-of-distribution data. We propose a novel data augmentation scheme, FourierMix, that produces augmentations to improve the spectral coverage of the training data. We find that FourierMix augmentations help eliminate the spectral bias of certifiably robust models enabling them to achieve significantly better robustness guarantees on a range of OOD benchmarks.
arXiv Detail & Related papers (2021-12-01T17:11:22Z)
A Frequency Perspective of Adversarial Robustness [72.48178241090149]
We present a frequency-based understanding of adversarial examples, supported by theoretical and empirical findings. Our analysis shows that adversarial examples are neither in high-frequency nor in low-frequency components, but are simply dataset dependent. We propose a frequency-based explanation for the commonly observed accuracy vs. robustness trade-off.
arXiv Detail & Related papers (2021-10-26T19:12:34Z)
Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize. We propose to utilize the high-frequency noises for face forgery detection. The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales. The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.