Related papers: The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

URL: http://arxiv.org/abs/2504.10694v1
Date: Mon, 14 Apr 2025 20:30:41 GMT
Title: The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Authors: Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr,
Abstract summary: We ask whether model outputs produced by existing jailbreaks are actually useful.<n>Our evaluation of eight representative jailbreaks reveals a consistent drop in model utility in jailbroken responses.<n>Overall, our work proposes the jailbreak tax as a new important metric in AI safety.
Score: 21.453837660747844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax

Related papers

LLM Jailbreak Detection for (Almost) Free! [62.466970731998714]
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks.<n>Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences.<n>We propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts.
arXiv Detail & Related papers (2025-09-18T02:42:52Z)
Many-Turn Jailbreaking [65.04921693379944]
We propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than a single target query.<n>We construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models.
arXiv Detail & Related papers (2025-08-09T00:02:39Z)
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift [10.737151905158926]
We show how to use continuous learning to detect jailbreaks and adapt rapidly to new emerging jailbreaks. We introduce an unsupervised active monitoring approach to identify novel jailbreaks.
arXiv Detail & Related papers (2025-04-28T03:01:51Z)
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [3.0700566896646047]
We show that different jailbreaking methods work via different nonlinear features in prompts. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on.
arXiv Detail & Related papers (2024-11-02T17:29:47Z)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [70.43466586161345]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.<n>Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment.<n>For instance, our challenge set ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z)
Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models [86.6931690001357]
knowledge-to-jailbreak aims to generate jailbreaking attacks from domain knowledge.<n>We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs.<n>Experiments show that jailbreak-generator can generate jailbreaks comparable in harmfulness to those crafted by human experts.
arXiv Detail & Related papers (2024-06-17T15:59:59Z)
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models [21.854909839996612]
Jailbreak attacks induce Large Language Models (LLMs) to generate harmful responses.<n>There is no consensus on evaluating jailbreaks.<n>JailbreakEval is a toolkit for evaluating jailbreak attempts.
arXiv Detail & Related papers (2024-06-13T16:59:43Z)
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models [4.547063832007314]
It is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other semantically-dissimilar classes. We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model's perception of prompt harmfulness.
arXiv Detail & Related papers (2024-06-13T16:26:47Z)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)
A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance. It scores the harmfulness of a victim model's responses to forbidden prompts. It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs [26.981225219312627]
We present a large-scale evaluation of various jailbreak attacks.<n>We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy.
arXiv Detail & Related papers (2024-02-08T13:42:50Z)
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V. By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.