GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
- URL: http://arxiv.org/abs/2405.13077v1
- Date: Tue, 21 May 2024 03:16:35 GMT
- Title: GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
- Authors: Govind Ramesh, Yao Dou, Wei Xu,
- Abstract summary: We introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach to jailbreaking with only black-box access.
Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target.
We find IRIS jailbreak success rates of 98% on GPT-4 and 92% on GPT-4 Turbo in under 7 queries.
- Score: 9.377563769107843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which is crucial for ensuring that even well-aligned LLMs obey adversarial instructions. IRIS then rates and enhances the output given the refined prompt to increase its harmfulness. We find IRIS achieves jailbreak success rates of 98% on GPT-4 and 92% on GPT-4 Turbo in under 7 queries. It significantly outperforms prior approaches in automatic, black-box and interpretable jailbreaking, while requiring substantially fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.
Related papers
- Can Large Language Models Automatically Jailbreak GPT-4V? [64.04997365446468]
We introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization.
Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3%.
This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.
arXiv Detail & Related papers (2024-07-23T17:50:45Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs)
It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator.
Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z) - Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts
Against Open-source LLMs [30.8029926520819]
Large Language Models (LLMs) generate text based on input sequences but are vulnerable to jailbreak attacks.
Jailbreak prompts are semantically more varied than the original questions used for queries.
We introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question.
arXiv Detail & Related papers (2024-02-21T15:13:50Z) - A StrongREJECT for Empty Jailbreaks [74.66228107886751]
There is no standard benchmark for measuring the severity of a jailbreak.
We present StrongREJECT, which better discriminates between effective and ineffective jailbreaks.
arXiv Detail & Related papers (2024-02-15T18:58:09Z) - All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks [0.0]
This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts.
Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM.
Our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions.
arXiv Detail & Related papers (2024-01-18T08:36:54Z) - Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [36.08357229578738]
We present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks.
TAP generates prompts that jailbreak state-of-the-art LLMs for more than 80% of the prompts using only a small number of queries.
TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
arXiv Detail & Related papers (2023-12-04T18:49:23Z) - Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V.
By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts.
We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.