DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM
Jailbreakers
- URL: http://arxiv.org/abs/2402.16914v2
- Date: Fri, 1 Mar 2024 07:26:50 GMT
- Title: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM
Jailbreakers
- Authors: Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
- Abstract summary: We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack)
DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
- Score: 80.18953043605696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The safety alignment of Large Language Models (LLMs) is vulnerable to both
manual and automated jailbreak attacks, which adversarially trigger LLMs to
output harmful content. However, current methods for jailbreaking LLMs, which
nest entire harmful prompts, are not effective at concealing malicious intent
and can be easily identified and rejected by well-aligned LLMs. This paper
discovers that decomposing a malicious prompt into separated sub-prompts can
effectively obscure its underlying malicious intent by presenting it in a
fragmented, less detectable form, thereby addressing these limitations. We
introduce an automatic prompt \textbf{D}ecomposition and
\textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack).
DrAttack includes three key components: (a) `Decomposition' of the original
prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly
by in-context learning with semantically similar but harmless reassembling
demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts'
synonyms that maintain the original intent while jailbreaking LLMs. An
extensive empirical study across multiple open-source and closed-source LLMs
demonstrates that, with a significantly reduced number of queries, DrAttack
obtains a substantial gain of success rate over prior SOTA prompt-only
attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries
surpassed previous art by 33.1\%. The project is available at
https://github.com/xirui-li/DrAttack.
Related papers
- Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks.
This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions.
Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z) - RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process [23.66988994636578]
We introduce RePD, an innovative attack framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs)
RePD operates on a one-shot learning model, wherein it accesses a database of jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts.
We have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
arXiv Detail & Related papers (2024-10-11T09:39:11Z) - Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models [15.764672596793352]
We analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies.
We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks.
arXiv Detail & Related papers (2024-08-05T12:20:39Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
We present a novel method that uses another Large Language Models, called the AdvPrompter, to generate human-readable adversarial prompts in seconds.
We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM.
The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.
arXiv Detail & Related papers (2024-04-21T22:18:13Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper [16.078682415975337]
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs)
This paper proposes a lightweight yet practical defense called SELFDEFEND.
It can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.
arXiv Detail & Related papers (2024-02-24T05:34:43Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - An LLM can Fool Itself: A Prompt-Based Adversarial Attack [26.460067102821476]
This paper proposes an efficient tool to audit the LLM's adversarial robustness via a prompt-based adversarial attack (PromptAttack)
PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself.
Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++.
arXiv Detail & Related papers (2023-10-20T08:16:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.