Related papers: AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

URL: http://arxiv.org/abs/2505.14103v2
Date: Wed, 21 May 2025 03:36:20 GMT
Title: AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
Authors: Guangke Chen, Fu Song, Zhe Zhao, Xiaojun Jia, Yang Liu, Yanchen Qiao, Weizhe Zhang,
Abstract summary: Jailbreak attacks to large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability.<n>We propose AudioJailbreak, a novel audio jailbreak attack featuring asynchrony, universality, stealthiness, and over-the-air robustness.
Score: 19.59499038333469
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website https://audiojailbreak.github.io/AudioJailbreak.

Related papers

Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is [8.214994509812724]
Large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks.<n>This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms.<n>We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models.
arXiv Detail & Related papers (2025-07-29T13:55:23Z)
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models [19.373533532464915]
We introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs.<n>We use this dataset to evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks.<n>Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs.
arXiv Detail & Related papers (2025-05-21T11:47:47Z)
Multilingual and Multi-Accent Jailbreaking of Audio LLMs [19.5428160851918]
Multi-AudioJail is the first systematic framework to exploit multilingual and multi-accent audio jailbreaks.<n>We show how acoustic perturbations interact with cross-lingual phonetics to cause jailbreak success rates to surge.<n>We plan to release our dataset to spur research into cross-modal defenses.
arXiv Detail & Related papers (2025-04-01T18:12:23Z)
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models [35.884976768636726]
Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks.<n>Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs.<n>These advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attack.
arXiv Detail & Related papers (2025-01-23T15:51:38Z)
Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models [0.0]
We propose a novel black-box jailbreak attacking framework that incorporates various LLM-as-Attacker methods.<n>Our method is designed based on three key observations from existing jailbreaking studies and practices.
arXiv Detail & Related papers (2024-10-31T01:55:33Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models [86.6931690001357]
knowledge-to-jailbreak aims to generate jailbreaking attacks from domain knowledge.<n>We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs.<n>Experiments show that jailbreak-generator can generate jailbreaks comparable in harmfulness to those crafted by human experts.
arXiv Detail & Related papers (2024-06-17T15:59:59Z)
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak [62.56901628534646]
This paper focuses on jailbreaking attacks against large language models (LLMs)<n>Our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness.
arXiv Detail & Related papers (2024-05-30T12:50:32Z)
Voice Jailbreak Attacks Against GPT-4o [27.505874745648498]
We present the first systematic measurement of jailbreak attacks against the voice mode of GPT-4o. We propose VoiceJailbreak, a novel voice jailbreak attack that humanizes GPT-4o and attempts to persuade it through fictional storytelling.
arXiv Detail & Related papers (2024-05-29T14:07:44Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.