Mitigating Closed-model Adversarial Examples with Bayesian Neural
Modeling for Enhanced End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2202.08532v1
- Date: Thu, 17 Feb 2022 09:17:58 GMT
- Title: Mitigating Closed-model Adversarial Examples with Bayesian Neural
Modeling for Enhanced End-to-End Speech Recognition
- Authors: Chao-Han Huck Yang, Zeeshan Ahmed, Yile Gu, Joseph Szurley, Roger Ren,
Linda Liu, Andreas Stolcke, Ivan Bulyko
- Abstract summary: We focus on a rigorous and empirical "closed-model adversarial robustness" setting.
We propose an advanced Bayesian neural network (BNN) based adversarial detector.
We improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets.
- Score: 18.83748866242237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we aim to enhance the system robustness of end-to-end automatic
speech recognition (ASR) against adversarially-noisy speech examples. We focus
on a rigorous and empirical "closed-model adversarial robustness" setting
(e.g., on-device or cloud applications). The adversarial noise is only
generated by closed-model optimization (e.g., evolutionary and zeroth-order
estimation) without accessing gradient information of a targeted ASR model
directly. We propose an advanced Bayesian neural network (BNN) based
adversarial detector, which could model latent distributions against adaptive
adversarial perturbation with divergence measurement. We further simulate
deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR
systems with the proposed adversarial detection system. Leveraging the proposed
BNN based detection system, we improve detection rate by +2.77 to +5.42%
(relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on
LibriSpeech datasets compared to the current model enhancement methods against
the adversarial speech examples.
Related papers
- ConvNext Based Neural Network for Anti-Spoofing [6.047242590232868]
Automatic speaker verification (ASV) has been widely used in the real life for identity authentication.
With the rapid development of speech conversion, speech algorithms and the improvement of the quality of recording devices, ASV systems are vulnerable for spoof attacks.
arXiv Detail & Related papers (2022-09-14T05:53:37Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - Robustifying automatic speech recognition by extracting slowly varying features [16.74051650034954]
We propose a defense mechanism against targeted adversarial attacks.
We use hybrid ASR models trained on data pre-processed in such a way.
Our model shows a performance on clean data similar to the baseline model, while being more than four times more robust.
arXiv Detail & Related papers (2021-12-14T13:50:23Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z) - Class-Conditional Defense GAN Against End-to-End Speech Attacks [82.21746840893658]
We propose a novel approach against end-to-end adversarial attacks developed to fool advanced speech-to-text systems such as DeepSpeech and Lingvo.
Unlike conventional defense approaches, the proposed approach does not directly employ low-level transformations such as autoencoding a given input signal.
Our defense-GAN considerably outperforms conventional defense algorithms in terms of word error rate and sentence level recognition accuracy.
arXiv Detail & Related papers (2020-10-22T00:02:02Z) - Audio Spoofing Verification using Deep Convolutional Neural Networks by
Transfer Learning [0.0]
We propose a speech classifier based on deep-convolutional neural network to detect spoofing attacks.
Our proposed methodology uses acoustic time-frequency representation of power spectral densities on Mel frequency scale.
We have achieved an equal error rate (EER) of 0.9056% on the development and 5.32% on the evaluation dataset of logical access scenario.
arXiv Detail & Related papers (2020-08-08T07:14:40Z) - Detecting Adversarial Examples for Speech Recognition via Uncertainty
Quantification [21.582072216282725]
Machine learning systems and, specifically, automatic speech recognition (ASR) systems are vulnerable to adversarial attacks.
In this paper, we focus on hybrid ASR systems and compare four acoustic models regarding their ability to indicate uncertainty under attack.
We are able to detect adversarial examples with an area under the receiving operator curve score of more than 0.99.
arXiv Detail & Related papers (2020-05-24T19:31:02Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.