Revisiting Acoustic Features for Robust ASR
- URL: http://arxiv.org/abs/2409.16399v1
- Date: Tue, 24 Sep 2024 18:58:23 GMT
- Title: Revisiting Acoustic Features for Robust ASR
- Authors: Muhammad A. Shah, Bhiksha Raj,
- Abstract summary: We revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception.
We propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression.
- Score: 25.687120601256787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems must be robust to the myriad types of noises present in real-world environments including environmental noise, room impulse response, special effects as well as attacks by malicious actors (adversarial attacks). Recent works seek to improve accuracy and robustness by developing novel Deep Neural Networks (DNNs) and curating diverse training datasets for them, while using relatively simple acoustic features. While this approach improves robustness to the types of noise present in the training data, it confers limited robustness against unseen noises and negligible robustness to adversarial attacks. In this paper, we revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception that could be used to perform accurate and robust ASR. In contrast, Specifically, we evaluate the ASR accuracy and robustness of several biologically inspired acoustic features. In addition to several features from prior works, such as gammatone filterbank features (GammSpec), we also propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression. Experiments on diverse models and datasets show that (1) DoGSpec achieves significantly better robustness than the highly popular log mel spectrogram (LogMelSpec) with minimal accuracy degradation, and (2) GammSpec achieves better accuracy and robustness to non-adversarial noises from the Speech Robust Bench benchmark, but it is outperformed by DoGSpec against adversarial attacks.
Related papers
- DEMONet: Underwater Acoustic Target Recognition based on Multi-Expert Network and Cross-Temporal Variational Autoencoder [22.271499386492533]
Building a robust underwater acoustic recognition system in real-world scenarios is challenging due to the complex underwater environment.
We propose DEMONet, which utilizes the detection of envelope modulation on noise (DEMON) to provide robust insights into the shaft frequency or blade counts of targets.
To mitigate noise and spurious modulation spectra in DEMON features, we introduce a cross-temporal alignment strategy and employ a variational autoencoder (VAE) to reconstruct noise-resistant DEMON spectra to replace the raw DEMON features.
arXiv Detail & Related papers (2024-11-05T03:04:51Z) - Filtered Randomized Smoothing: A New Defense for Robust Modulation Classification [16.974803642923465]
We study the problem of designing robust modulation classifiers that can provide provable defense against arbitrary attacks.
We propose Filtered Randomized Smoothing (FRS), a novel defense which combines spectral filtering together with randomized smoothing.
We show that FRS significantly outperforms existing defenses including AT and RS in terms of accuracy on both attacked and benign signals.
arXiv Detail & Related papers (2024-10-08T20:17:25Z) - A Spectral Perspective towards Understanding and Improving Adversarial
Robustness [8.912245110734334]
adversarial training (AT) has proven to be an effective defense approach, but mechanism for robustness improvement is not fully understood.
We show that AT induces the deep model to focus more on the low-frequency region, which retains the shape-biased representations, to gain robustness.
We propose a spectral alignment regularization (SAR) such that the spectral output inferred by an attacked adversarial input stays as close as possible to its natural input counterpart.
arXiv Detail & Related papers (2023-06-25T14:47:03Z) - Improve Noise Tolerance of Robust Loss via Noise-Awareness [60.34670515595074]
We propose a meta-learning method which is capable of adaptively learning a hyper parameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster for brevity)
Four SOTA robust loss functions are attempted to be integrated with our algorithm, and comprehensive experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and performance.
arXiv Detail & Related papers (2023-01-18T04:54:58Z) - Leveraging Domain Features for Detecting Adversarial Attacks Against
Deep Speech Recognition in Noise [18.19207291891767]
adversarial attacks against deep ASR systems are highly successful.
This work leverages filter bank-based features to better capture the characteristics of attacks for improved detection.
Inverse filter bank features generally perform better in both clean and noisy environments.
arXiv Detail & Related papers (2022-11-03T07:25:45Z) - SAR Despeckling using a Denoising Diffusion Probabilistic Model [52.25981472415249]
The presence of speckle degrades the image quality and adversely affects the performance of SAR image understanding applications.
We introduce SAR-DDPM, a denoising diffusion probabilistic model for SAR despeckling.
The proposed method achieves significant improvements in both quantitative and qualitative results over the state-of-the-art despeckling methods.
arXiv Detail & Related papers (2022-06-09T14:00:26Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - Robustifying automatic speech recognition by extracting slowly varying features [16.74051650034954]
We propose a defense mechanism against targeted adversarial attacks.
We use hybrid ASR models trained on data pre-processed in such a way.
Our model shows a performance on clean data similar to the baseline model, while being more than four times more robust.
arXiv Detail & Related papers (2021-12-14T13:50:23Z) - Certified Adversarial Defenses Meet Out-of-Distribution Corruptions:
Benchmarking Robustness and Simple Baselines [65.0803400763215]
This work critically examines how adversarial robustness guarantees change when state-of-the-art certifiably robust models encounter out-of-distribution data.
We propose a novel data augmentation scheme, FourierMix, that produces augmentations to improve the spectral coverage of the training data.
We find that FourierMix augmentations help eliminate the spectral bias of certifiably robust models enabling them to achieve significantly better robustness guarantees on a range of OOD benchmarks.
arXiv Detail & Related papers (2021-12-01T17:11:22Z) - A Frequency Perspective of Adversarial Robustness [72.48178241090149]
We present a frequency-based understanding of adversarial examples, supported by theoretical and empirical findings.
Our analysis shows that adversarial examples are neither in high-frequency nor in low-frequency components, but are simply dataset dependent.
We propose a frequency-based explanation for the commonly observed accuracy vs. robustness trade-off.
arXiv Detail & Related papers (2021-10-26T19:12:34Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.