Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
- URL: http://arxiv.org/abs/2403.17336v2
- Date: Mon, 30 Sep 2024 21:25:23 GMT
- Title: Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
- Authors: Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, Ning Zhang,
- Abstract summary: generative AI has enabled ubiquitous access to large language models (LLMs)
Jailbreak prompts have emerged as one of the most effective mechanisms to circumvent security restrictions and elicit harmful content originally designed to be prohibited.
We show that users often succeeded in jailbreak prompts generation regardless of their expertise in LLMs.
We also develop a system using AI as the assistant to automate the process of jailbreak prompt generation.
- Score: 29.312244478583665
- License:
- Abstract: Recent advancements in generative AI have enabled ubiquitous access to large language models (LLMs). Empowered by their exceptional capabilities to understand and generate human-like text, these models are being increasingly integrated into our society. At the same time, there are also concerns on the potential misuse of this powerful technology, prompting defensive measures from service providers. To overcome such protection, jailbreaking prompts have recently emerged as one of the most effective mechanisms to circumvent security restrictions and elicit harmful content originally designed to be prohibited. Due to the rapid development of LLMs and their ease of access via natural languages, the frontline of jailbreak prompts is largely seen in online forums and among hobbyists. To gain a better understanding of the threat landscape of semantically meaningful jailbreak prompts, we systemized existing prompts and measured their jailbreak effectiveness empirically. Further, we conducted a user study involving 92 participants with diverse backgrounds to unveil the process of manually creating jailbreak prompts. We observed that users often succeeded in jailbreak prompts generation regardless of their expertise in LLMs. Building on the insights from the user study, we also developed a system using AI as the assistant to automate the process of jailbreak prompt generation.
Related papers
- MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue [36.44365630876591]
Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities.
LLMs have been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks.
We propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values.
arXiv Detail & Related papers (2024-11-06T10:32:09Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Poisoned LangChain: Jailbreak LLMs by LangChain [9.658883589561915]
We propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain.
We tested this method on six different large language models across three major categories of jailbreak issues.
arXiv Detail & Related papers (2024-06-26T07:21:02Z) - Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack [86.6931690001357]
Knowledge-to-jailbreak aims to generate jailbreaks from domain knowledge to evaluate the safety of large language models on specialized domains.
We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator.
arXiv Detail & Related papers (2024-06-17T15:59:59Z) - Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts.
We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards.
Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z) - Foot In The Door: Understanding Large Language Model Jailbreaking via
Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs)
We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z) - GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models [14.571852591904092]
One major safety measure is to proactively test the Large Language Models with jailbreaks prior to the release.
We propose a novel yet intuitive strategy to generate jailbreaks in the style of the human generation.
Our system of different roles will leverage this knowledge graph to generate new jailbreaks.
arXiv Detail & Related papers (2024-02-05T18:54:43Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z) - "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models [50.22128133926407]
We conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023.
We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies.
We identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4.
arXiv Detail & Related papers (2023-08-07T16:55:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.