Related papers: Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs

Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs

URL: http://arxiv.org/abs/2505.14286v1
Date: Tue, 20 May 2025 12:35:59 GMT
Title: Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs
Authors: Rao Ma, Mengjie Qian, Vyas Raina, Mark Gales, Kate Knill,
Abstract summary: We investigate universal acoustic adversarial attacks on speech LLMs.<n>We find critical vulnerabilities in Qwen2-Audio and Granite-Speech.<n>This highlights the need for more robust training strategies and improved resistance to adversarial attacks.
Score: 6.8285467057172555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The combination of pre-trained speech encoders with large language models has enabled the development of speech LLMs that can handle a wide range of spoken language processing tasks. While these models are powerful and flexible, this very flexibility may make them more vulnerable to adversarial attacks. To examine the extent of this problem, in this work we investigate universal acoustic adversarial attacks on speech LLMs. Here a fixed, universal, adversarial audio segment is prepended to the original input audio. We initially investigate attacks that cause the model to either produce no output or to perform a modified task overriding the original prompt. We then extend the nature of the attack to be selective so that it activates only when specific input attributes, such as a speaker gender or spoken language, are present. Inputs without the targeted attribute should be unaffected, allowing fine-grained control over the model outputs. Our findings reveal critical vulnerabilities in Qwen2-Audio and Granite-Speech and suggest that similar speech LLMs may be susceptible to universal adversarial attacks. This highlights the need for more robust training strategies and improved resistance to adversarial attacks.

Related papers

Breaking Audio Large Language Models by Attacking Only the Encoder: A Universal Targeted Latent-Space Audio Attack [0.0]
We propose a universal targeted latent space attack on audio-language models.<n>Our approach learns a universal perturbation that generalizes across inputs and speakers and does not require access to the language model.
arXiv Detail & Related papers (2025-12-29T21:56:13Z)
Backdoor Attacks Against Speech Language Models [63.07317091368079]
We present the first systematic study of audio backdoor attacks against speech language models.<n>We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks.<n>We propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
arXiv Detail & Related papers (2025-10-01T17:45:04Z)
BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs [84.59993864748195]
We propose a new paradigm inspired by operationalism'' that decouples instruction understanding from speech generation.<n>We introduce BatonVoice, a framework where an LLM acts as a conductor'', understanding user instructions.<n>A separate TTS model, the orchestra'', then generates the speech from these features.
arXiv Detail & Related papers (2025-09-30T16:52:14Z)
SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models [4.67675814519416]
Speech-based human-computer interaction is vulnerable to backdoor attacks.<n>In this paper, we propose that speech backdoor attacks can strategically focus on speech elements such as timbre and emotion.<n>The proposed attack is called the Speech Prompt Backdoor Attack (SPBA)
arXiv Detail & Related papers (2025-06-10T02:01:00Z)
Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework [6.002582335323663]
We present an adversarial attack targeting the speech input of aligned Multimodal Large Language Models (MLLMs) in a white box scenario.<n>We introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences.<n>Our approach achieves up to 89 percent attack success rate across multiple restricted tasks.
arXiv Detail & Related papers (2025-05-24T20:46:36Z)
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models [0.9480364746270077]
This paper explores audio jailbreaks targeting Audio-Language Models (ALMs)<n>We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples.<n>We analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech.
arXiv Detail & Related papers (2025-02-02T08:36:23Z)
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models [3.1511847280063696]
Speech enabled foundation models can perform tasks other than automatic speech recognition using an appropriate prompt. With the development of audio-prompted large language models there is the potential for even greater control options. We demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks.
arXiv Detail & Related papers (2024-07-05T13:04:31Z)
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z)
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models [5.942307521138583]
We show that special tokens' can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $texttt|endoftext|>$ token. Experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97% of speech samples.
arXiv Detail & Related papers (2024-05-09T22:59:23Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z)
Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.