Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
- URL: http://arxiv.org/abs/2505.15406v1
- Date: Wed, 21 May 2025 11:47:47 GMT
- Title: Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
- Authors: Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, Zhenhao Chen, Xiuying Chen,
- Abstract summary: We introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs.<n>We use this dataset to evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks.<n>Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs.
- Score: 19.373533532464915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.
Related papers
- AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models [19.59499038333469]
Jailbreak attacks to large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability.<n>We propose AudioJailbreak, a novel audio jailbreak attack featuring asynchrony, universality, stealthiness, and over-the-air robustness.
arXiv Detail & Related papers (2025-05-20T09:10:45Z) - SPIRIT: Patching Speech Language Models against Jailbreak Attacks [21.299244714520828]
Speech Language Models (SLMs) enable natural interactions via spoken instructions.<n>We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks.<n>To improve security, we propose post-hoc patching defenses used to intervene during inference.
arXiv Detail & Related papers (2025-05-18T21:51:24Z) - Multilingual and Multi-Accent Jailbreaking of Audio LLMs [19.5428160851918]
Multi-AudioJail is the first systematic framework to exploit multilingual and multi-accent audio jailbreaks.<n>We show how acoustic perturbations interact with cross-lingual phonetics to cause jailbreak success rates to surge.<n>We plan to release our dataset to spur research into cross-modal defenses.
arXiv Detail & Related papers (2025-04-01T18:12:23Z) - Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models [35.884976768636726]
Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks.<n>Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs.<n>These advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attack.
arXiv Detail & Related papers (2025-01-23T15:51:38Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z) - An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language.<n>We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content.
evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address.
JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z) - A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance.
It scores the harmfulness of a victim model's responses to forbidden prompts.
It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.