ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features
- URL: http://arxiv.org/abs/2408.01808v1
- Date: Sat, 3 Aug 2024 15:30:16 GMT
- Title: ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features
- Authors: Peng Cheng, Yuwei Wang, Peng Huang, Zhongjie Ba, Xiaodong Lin, Feng Lin, Li Lu, Kui Ren,
- Abstract summary: ALIF is the first black-box adversarial linguistic feature-based attack pipeline.
We present ALIF-OTL and ALIF-OTA schemes for launching attacks in both the digital domain and the physical playback environment.
- Score: 25.28307679567351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extensive research has revealed that adversarial examples (AE) pose a significant threat to voice-controllable smart devices. Recent studies have proposed black-box adversarial attacks that require only the final transcription from an automatic speech recognition (ASR) system. However, these attacks typically involve many queries to the ASR, resulting in substantial costs. Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. Building on this observation, we propose ALIF, the first black-box adversarial linguistic feature-based attack pipeline. We leverage the reciprocal process of text-to-speech (TTS) and ASR models to generate perturbations in the linguistic embedding space where the decision boundary resides. Based on the ALIF pipeline, we present the ALIF-OTL and ALIF-OTA schemes for launching attacks in both the digital domain and the physical playback environment on four commercial ASRs and voice assistants. Extensive evaluations demonstrate that ALIF-OTL and -OTA significantly improve query efficiency by 97.7% and 73.3%, respectively, while achieving competitive performance compared to existing methods. Notably, ALIF-OTL can generate an attack sample with only one query. Furthermore, our test-of-time experiment validates the robustness of our approach against ASR updates.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - AudioFool: Fast, Universal and synchronization-free Cross-Domain Attack
on Speech Recognition [0.9913418444556487]
We investigate the needed properties of robust attacks compatible with the Over-The-Air (OTA) model.
We design a method of generating attacks with arbitrary such desired properties.
We evaluate our method on standard keyword classification tasks and analyze it in OTA.
arXiv Detail & Related papers (2023-09-20T16:59:22Z) - Robustifying automatic speech recognition by extracting slowly varying features [16.74051650034954]
We propose a defense mechanism against targeted adversarial attacks.
We use hybrid ASR models trained on data pre-processed in such a way.
Our model shows a performance on clean data similar to the baseline model, while being more than four times more robust.
arXiv Detail & Related papers (2021-12-14T13:50:23Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Blackbox Untargeted Adversarial Testing of Automatic Speech Recognition
Systems [1.599072005190786]
Speech recognition systems are prevalent in applications for voice navigation and voice control of domestic appliances.
Deep neural networks (DNNs) have been shown to be susceptible to adversarial perturbations.
To help test the correctness of ASRS, we propose techniques that automatically generate blackbox.
arXiv Detail & Related papers (2021-12-03T10:21:47Z) - Speech Pattern based Black-box Model Watermarking for Automatic Speech
Recognition [83.2274907780273]
How to design a black-box watermarking scheme for automatic speech recognition models is still an unsolved problem.
We propose the first black-box model watermarking framework for protecting the IP of ASR models.
Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed watermarking scheme.
arXiv Detail & Related papers (2021-10-19T09:01:41Z) - Perceptual-based deep-learning denoiser as a defense against adversarial
attacks on ASR systems [26.519207339530478]
Adversarial attacks attempt to force misclassification by adding small perturbations to the original speech signal.
We propose to counteract this by employing a neural-network based denoiser as a pre-processor in the ASR pipeline.
We found that training the denoisier using a perceptually motivated loss function resulted in increased adversarial robustness.
arXiv Detail & Related papers (2021-07-12T07:00:06Z) - Detecting Audio Attacks on ASR Systems with Dropout Uncertainty [40.9172128924305]
We show that our defense is able to detect attacks created through optimized perturbations and frequency masking.
We test our defense on Mozilla's CommonVoice dataset, the UrbanSound dataset, and an excerpt of the LibriSpeech dataset.
arXiv Detail & Related papers (2020-06-02T19:40:38Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.