Related papers: JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

URL: http://arxiv.org/abs/2406.09321v1
Date: Thu, 13 Jun 2024 16:59:43 GMT
Title: JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Authors: Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, Anyu Wang,
Abstract summary: Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions. There is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful. JailbreakEval is a user-friendly toolkit focusing on the evaluation of jailbreak attempts.
Score: 21.854909839996612
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions, presenting severe misuse threats to LLMs. Up to now, research into jailbreak attacks and defenses is emerging, however, there is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful. In other words, the methods to assess the harmfulness of an LLM's response are varied, such as manual annotation or prompting GPT-4 in specific ways. Each approach has its own set of strengths and weaknesses, impacting their alignment with human values, as well as the time and financial cost. This diversity in evaluation presents challenges for researchers in choosing suitable evaluation methods and conducting fair comparisons across different jailbreak attacks and defenses. In this paper, we conduct a comprehensive analysis of jailbreak evaluation methodologies, drawing from nearly ninety jailbreak research released between May 2023 and April 2024. Our study introduces a systematic taxonomy of jailbreak evaluators, offering in-depth insights into their strengths and weaknesses, along with the current status of their adaptation. Moreover, to facilitate subsequent research, we propose JailbreakEval, a user-friendly toolkit focusing on the evaluation of jailbreak attempts. It includes various well-known evaluators out-of-the-box, so that users can obtain evaluation results with only a single command. JailbreakEval also allows users to customize their own evaluation workflow in a unified framework with the ease of development and comparison. In summary, we regard JailbreakEval to be a catalyst that simplifies the evaluation process in jailbreak research and fosters an inclusive standard for jailbreak evaluation within the community.

Related papers

LLM Jailbreak Detection for (Almost) Free! [62.466970731998714]
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks.<n>Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences.<n>We propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts.
arXiv Detail & Related papers (2025-09-18T02:42:52Z)
JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring [45.76641811031552]
We introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework.<n>Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision.<n>We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans.
arXiv Detail & Related papers (2025-08-28T14:40:27Z)
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation [22.75124155879712]
Large language models (LLMs) remain vulnerable to jailbreak attacks. We propose a comprehensive jailbreak defense framework, JBShield, consisting of two key components: jailbreak detection JBShield-D and mitigation JBShield-M.
arXiv Detail & Related papers (2025-02-11T13:50:50Z)
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples [13.841146655178585]
We develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. We evaluate five rapid response methods, all of which use jailbreak proliferation. Our strongest method reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set.
arXiv Detail & Related papers (2024-11-12T02:44:49Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Rethinking How to Evaluate Language Model Jailbreak [16.301224741410312]
We propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems.
arXiv Detail & Related papers (2024-04-09T15:54:16Z)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)
A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance. It scores the harmfulness of a victim model's responses to forbidden prompts. It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)
Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs. Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates. We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.