Related papers: Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

URL: http://arxiv.org/abs/2505.18864v1
Date: Sat, 24 May 2025 20:46:36 GMT
Title: Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework
Authors: Binhao Ma, Hanqing Guo, Zhengping Jay Luo, Rui Duan,
Abstract summary: We present an adversarial attack targeting the speech input of aligned Multimodal Large Language Models (MLLMs) in a white box scenario.<n>We introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences.<n>Our approach achieves up to 89 percent attack success rate across multiple restricted tasks.
Score: 6.002582335323663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.

Related papers

Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models [26.648297855855432]
We design a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream.<n>The attack exploits structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text.<n>Results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations.
arXiv Detail & Related papers (2026-01-30T18:23:02Z)
Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio [63.18443674004945]
This work explores a content-centric threat: exploiting TTS systems to produce speech containing harmful content.<n>We present HARMGEN, a suite of five attacks organized into two families that address these challenges.
arXiv Detail & Related papers (2025-11-14T03:00:04Z)
Backdoor Attacks Against Speech Language Models [63.07317091368079]
We present the first systematic study of audio backdoor attacks against speech language models.<n>We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks.<n>We propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
arXiv Detail & Related papers (2025-10-01T17:45:04Z)
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities [76.9327488986162]
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images.<n>We exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction.<n>Our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B)
arXiv Detail & Related papers (2025-05-31T13:11:14Z)
SPIRIT: Patching Speech Language Models against Jailbreak Attacks [21.299244714520828]
Speech Language Models (SLMs) enable natural interactions via spoken instructions.<n>We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks.<n>To improve security, we propose post-hoc patching defenses used to intervene during inference.
arXiv Detail & Related papers (2025-05-18T21:51:24Z)
Multilingual and Multi-Accent Jailbreaking of Audio LLMs [19.5428160851918]
Multi-AudioJail is the first systematic framework to exploit multilingual and multi-accent audio jailbreaks.<n>We show how acoustic perturbations interact with cross-lingual phonetics to cause jailbreak success rates to surge.<n>We plan to release our dataset to spur research into cross-modal defenses.
arXiv Detail & Related papers (2025-04-01T18:12:23Z)
Exploiting Vulnerabilities in Speech Translation Systems through Targeted Adversarial Attacks [59.87470192277124]
This paper explores methods of compromising speech translation systems through imperceptible audio manipulations.<n>We present two innovative approaches: (1) the injection of perturbation into source audio, and (2) the generation of adversarial music designed to guide targeted translation.<n>Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly.<n>The implications of this research extend beyond immediate security concerns, shedding light on the interpretability and robustness of neural speech processing systems.
arXiv Detail & Related papers (2025-03-02T16:38:16Z)
`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs [33.49407213040455]
We introduce the first voice-based jailbreak attack against multimodal large language models (LLMs)<n>We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts.<n>We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs.
arXiv Detail & Related papers (2025-02-02T10:05:08Z)
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models [0.9480364746270077]
This paper explores audio jailbreaks targeting Audio-Language Models (ALMs)<n>We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples.<n>We analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech.
arXiv Detail & Related papers (2025-02-02T08:36:23Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models [34.557309967708406]
In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. We design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics.
arXiv Detail & Related papers (2024-05-14T04:51:23Z)
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.