LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper
- URL: http://arxiv.org/abs/2402.15727v2
- Date: Mon, 4 Mar 2024 05:37:40 GMT
- Title: LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper
- Authors: Daoyuan Wu and Shuai Wang and Yang Liu and Ning Liu
- Abstract summary: Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs)
This paper proposes a lightweight yet practical defense called SELFDEFEND.
It can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.
- Score: 16.078682415975337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Jailbreaking is an emerging adversarial attack that bypasses the safety
alignment deployed in off-the-shelf large language models (LLMs). A
considerable amount of research exists proposing more effective jailbreak
attacks, including the recent Greedy Coordinate Gradient (GCG) attack,
jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and
multilingual jailbreak. In contrast, the defensive side has been relatively
less explored. This paper proposes a lightweight yet practical defense called
SELFDEFEND, which can defend against all existing jailbreak attacks with
minimal delay for jailbreak prompts and negligible delay for normal user
prompts. Our key insight is that regardless of the kind of jailbreak strategies
employed, they eventually need to include a harmful prompt (e.g., "how to make
a bomb") in the prompt sent to LLMs, and we found that existing LLMs can
effectively recognize such harmful prompts that violate their safety policies.
Based on this insight, we design a shadow stack that concurrently checks
whether a harmful prompt exists in the user prompt and triggers a checkpoint in
the normal stack once a token of "No" or a harmful prompt is output. The latter
could also generate an explainable LLM response to adversarial prompts. We
demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through
manual analysis in GPT-3.5/4. We also list three future directions to further
enhance SELFDEFEND.
Related papers
- HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF)
HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins.
It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts.
We introduce BOOST, a simple attack that leverages only the eos tokens.
Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z) - Defending LLMs against Jailbreaking Attacks via Backtranslation [61.878363293735624]
We propose a new method for defending LLMs against jailbreaking attacks by backtranslation''
The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt.
We empirically demonstrate that our defense significantly outperforms the baselines.
arXiv Detail & Related papers (2024-02-26T10:03:33Z) - DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM
Jailbreakers [80.18953043605696]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack)
DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z) - Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [34.36053833900958]
We present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks.
TAP generates prompts that jailbreak state-of-the-art LLMs for more than 80% of the prompts.
TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
arXiv Detail & Related papers (2023-12-04T18:49:23Z) - Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V.
By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts.
We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.