Related papers: Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

URL: http://arxiv.org/abs/2404.01833v1
Date: Tue, 2 Apr 2024 10:45:49 GMT
Title: Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
Authors: Mark Russinovich, Ahmed Salem, Ronen Eldan,
Abstract summary: We introduce a novel jailbreak attack called Crescendo. Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b Chat, and Anthropic Chat.
Score: 5.912639903214644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies, progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we introduce Crescendomation, a tool that automates the Crescendo attack, and our evaluation showcases its effectiveness against state-of-the-art models.

Related papers

A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks [3.8246557700763715]
We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations.<n>Our results help explain why single-turn jailbreak defenses are generally ineffective against multi-turn attacks.
arXiv Detail & Related papers (2025-06-29T23:28:55Z)
Multi-turn Jailbreaking via Global Refinement and Active Fabrication [29.84573206944952]
We propose a novel multi-turn jailbreaking method that refines the jailbreaking path globally at each interaction.<n> Experimental results demonstrate the superior performance of our method compared with existing single-turn and multi-turn jailbreaking techniques.
arXiv Detail & Related papers (2025-06-22T03:15:05Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples. We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples [13.841146655178585]
We develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. We evaluate five rapid response methods, all of which use jailbreak proliferation. Our strongest method reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set.
arXiv Detail & Related papers (2024-11-12T02:44:49Z)
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs [33.87649859430635]
Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks. We introduce a novel jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Our method achieves attack success rates of over 90%,80% and 74%, respectively, exceeding existing baselines by more than 60%.
arXiv Detail & Related papers (2024-09-23T10:03:09Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Poisoned LangChain: Jailbreak LLMs by LangChain [9.658883589561915]
We propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. We tested this method on six different large language models across three major categories of jailbreak issues.
arXiv Detail & Related papers (2024-06-26T07:21:02Z)
Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z)
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models [29.312244478583665]
generative AI has enabled ubiquitous access to large language models (LLMs) Jailbreak prompts have emerged as one of the most effective mechanisms to circumvent security restrictions and elicit harmful content originally designed to be prohibited. We show that users often succeeded in jailbreak prompts generation regardless of their expertise in LLMs. We also develop a system using AI as the assistant to automate the process of jailbreak prompt generation.
arXiv Detail & Related papers (2024-03-26T02:47:42Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.