A StrongREJECT for Empty Jailbreaks
- URL: http://arxiv.org/abs/2402.10260v1
- Date: Thu, 15 Feb 2024 18:58:09 GMT
- Title: A StrongREJECT for Empty Jailbreaks
- Authors: Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh,
Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins,
Sam Toyer
- Abstract summary: There is no standard benchmark for measuring the severity of a jailbreak.
We present StrongREJECT, which better discriminates between effective and ineffective jailbreaks.
- Score: 74.66228107886751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rise of large language models (LLMs) has drawn attention to the existence
of "jailbreaks" that allow the models to be used maliciously. However, there is
no standard benchmark for measuring the severity of a jailbreak, leaving
authors of jailbreak papers to create their own. We show that these benchmarks
often include vague or unanswerable questions and use grading criteria that are
biased towards overestimating the misuse potential of low-quality model
responses. Some jailbreak techniques make the problem worse by decreasing the
quality of model responses even on benign questions: we show that several
jailbreaking techniques substantially reduce the zero-shot performance of GPT-4
on MMLU. Jailbreaks can also make it harder to elicit harmful responses from an
"uncensored" open-source model. We present a new benchmark, StrongREJECT, which
better discriminates between effective and ineffective jailbreaks by using a
higher-quality question set and a more accurate response grading algorithm. We
show that our new grading scheme better accords with human judgment of response
quality and overall jailbreak effectiveness, especially on the sort of
low-quality responses that contribute the most to over-estimation of jailbreak
performance on existing benchmarks. We release our code and data at
https://github.com/alexandrasouly/strongreject.
Related papers
- WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics.
WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z) - JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models [21.854909839996612]
Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions.
There is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful.
JailbreakEval is a user-friendly toolkit focusing on the evaluation of jailbreak attempts.
arXiv Detail & Related papers (2024-06-13T16:59:43Z) - Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models [4.547063832007314]
This paper analyses model activations on different jailbreak inputs.
We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other classes.
arXiv Detail & Related papers (2024-06-13T16:26:47Z) - GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation [9.377563769107843]
We introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach to jailbreaking with only black-box access.
Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target.
We find IRIS jailbreak success rates of 98% on GPT-4 and 92% on GPT-4 Turbo in under 7 queries.
arXiv Detail & Related papers (2024-05-21T03:16:35Z) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content.
evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address.
JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z) - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs)
It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator.
Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z) - Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts
Against Open-source LLMs [30.8029926520819]
Large Language Models (LLMs) generate text based on input sequences but are vulnerable to jailbreak attacks.
Jailbreak prompts are semantically more varied than the original questions used for queries.
We introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question.
arXiv Detail & Related papers (2024-02-21T15:13:50Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.