All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- URL: http://arxiv.org/abs/2401.09798v3
- Date: Mon, 12 Feb 2024 02:29:28 GMT
- Title: All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- Authors: Kazuhiro Takemoto
- Abstract summary: This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts.
Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM.
Our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), such as ChatGPT, encounter `jailbreak'
challenges, wherein safeguards are circumvented to generate ethically harmful
prompts. This study introduces a straightforward black-box method for
efficiently crafting jailbreak prompts, addressing the significant complexity
and computational costs associated with conventional methods. Our technique
iteratively transforms harmful prompts into benign expressions directly
utilizing the target LLM, predicated on the hypothesis that LLMs can
autonomously generate expressions that evade safeguards. Through experiments
conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method
consistently achieved an attack success rate exceeding 80% within an average of
five iterations for forbidden questions and proved robust against model
updates. The jailbreak prompts generated were not only naturally-worded and
succinct but also challenging to defend against. These findings suggest that
the creation of effective jailbreak prompts is less complex than previously
believed, underscoring the heightened risk posed by black-box jailbreak
attacks.
Related papers
- xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.
We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)
We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.
We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.
Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - SQL Injection Jailbreak: A Structural Disaster of Large Language Models [71.55108680517422]
We introduce a novel jailbreak method, which targets the external properties of LLMs.
By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.
We propose a simple defense method called Self-Reminder-Key to counter SIJ.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning [19.45092401994873]
This study investigates indirect jailbreak attacks on Large Language Models(LLMs)
We introduce a novel attack vector named Retrieval Augmented Generation Poisoning.
Pandora exploits the synergy between LLMs and RAG through prompt manipulation to generate unexpected responses.
arXiv Detail & Related papers (2024-02-13T12:40:39Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks [12.540530764250812]
We propose a formalism and a taxonomy of known (and possible) jailbreaks.
We release a dataset of model outputs across 3700 jailbreak prompts over 4 tasks.
arXiv Detail & Related papers (2023-05-24T09:57:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.