All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- URL: http://arxiv.org/abs/2401.09798v3
- Date: Mon, 12 Feb 2024 02:29:28 GMT
- Title: All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- Authors: Kazuhiro Takemoto
- Abstract summary: This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts.
Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM.
Our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), such as ChatGPT, encounter `jailbreak'
challenges, wherein safeguards are circumvented to generate ethically harmful
prompts. This study introduces a straightforward black-box method for
efficiently crafting jailbreak prompts, addressing the significant complexity
and computational costs associated with conventional methods. Our technique
iteratively transforms harmful prompts into benign expressions directly
utilizing the target LLM, predicated on the hypothesis that LLMs can
autonomously generate expressions that evade safeguards. Through experiments
conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method
consistently achieved an attack success rate exceeding 80% within an average of
five iterations for forbidden questions and proved robust against model
updates. The jailbreak prompts generated were not only naturally-worded and
succinct but also challenging to defend against. These findings suggest that
the creation of effective jailbreak prompts is less complex than previously
believed, underscoring the heightened risk posed by black-box jailbreak
attacks.
Related papers
- Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [50.89022445197919]
Large language models (LLMs) have exhibited outstanding performance in engaging with humans.
LLMs are vulnerable to jailbreak attacks, leading to the generation of harmful responses.
We propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs.
arXiv Detail & Related papers (2024-10-15T10:07:15Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs [33.87649859430635]
Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks.
We introduce a novel jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs.
Our method achieves attack success rates of over 90%,80% and 74%, respectively, exceeding existing baselines by more than 60%.
arXiv Detail & Related papers (2024-09-23T10:03:09Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs.
Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates.
We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models [50.22128133926407]
We conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023.
We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies.
We identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4.
arXiv Detail & Related papers (2023-08-07T16:55:20Z) - Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks [12.540530764250812]
We propose a formalism and a taxonomy of known (and possible) jailbreaks.
We release a dataset of model outputs across 3700 jailbreak prompts over 4 tasks.
arXiv Detail & Related papers (2023-05-24T09:57:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.