Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models
- URL: http://arxiv.org/abs/2405.06134v2
- Date: Wed, 17 Jul 2024 15:59:26 GMT
- Title: Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models
- Authors: Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales,
- Abstract summary: We show that special tokens' can be exploited by adversarial attacks to manipulate the model's behavior.
We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $texttt|endoftext|>$ token.
Experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97% of speech samples.
- Score: 5.942307521138583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $\texttt{<|endoftext|>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel)
Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z) - Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models [3.1511847280063696]
Speech enabled foundation models can perform tasks other than automatic speech recognition using an appropriate prompt.
With the development of audio-prompted large language models there is the potential for even greater control options.
We demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks.
arXiv Detail & Related papers (2024-07-05T13:04:31Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - Loss Masking Is Not Needed in Decoder-only Transformer for
Discrete-token-based ASR [58.136778669618096]
unified speech-text models have achieved remarkable performance on various speech tasks.
We propose to model speech tokens in an autoregressive way, similar to text.
We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance.
arXiv Detail & Related papers (2023-11-08T08:45:14Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts.
Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - GANtron: Emotional Speech Synthesis with Generative Adversarial Networks [0.0]
We propose a text-to-speech model where the inferred speech can be tuned with the desired emotions.
We use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism.
arXiv Detail & Related papers (2021-10-06T10:44:30Z) - Audio Adversarial Examples: Attacks Using Vocal Masks [0.0]
We construct audio adversarial examples on automatic Speech-To-Text systems.
We produce an another by overlaying an audio vocal mask generated from the original audio.
We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx.
arXiv Detail & Related papers (2021-02-04T05:21:10Z) - Cortical Features for Defense Against Adversarial Audio Attacks [55.61885805423492]
We propose using a computational model of the auditory cortex as a defense against adversarial attacks on audio.
We show that the cortical features help defend against universal adversarial examples.
arXiv Detail & Related papers (2021-01-30T21:21:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.