LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper
- URL: http://arxiv.org/abs/2402.15727v2
- Date: Mon, 4 Mar 2024 05:37:40 GMT
- Title: LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper
- Authors: Daoyuan Wu and Shuai Wang and Yang Liu and Ning Liu
- Abstract summary: Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs)
This paper proposes a lightweight yet practical defense called SELFDEFEND.
It can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.
- Score: 16.078682415975337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Jailbreaking is an emerging adversarial attack that bypasses the safety
alignment deployed in off-the-shelf large language models (LLMs). A
considerable amount of research exists proposing more effective jailbreak
attacks, including the recent Greedy Coordinate Gradient (GCG) attack,
jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and
multilingual jailbreak. In contrast, the defensive side has been relatively
less explored. This paper proposes a lightweight yet practical defense called
SELFDEFEND, which can defend against all existing jailbreak attacks with
minimal delay for jailbreak prompts and negligible delay for normal user
prompts. Our key insight is that regardless of the kind of jailbreak strategies
employed, they eventually need to include a harmful prompt (e.g., "how to make
a bomb") in the prompt sent to LLMs, and we found that existing LLMs can
effectively recognize such harmful prompts that violate their safety policies.
Based on this insight, we design a shadow stack that concurrently checks
whether a harmful prompt exists in the user prompt and triggers a checkpoint in
the normal stack once a token of "No" or a harmful prompt is output. The latter
could also generate an explainable LLM response to adversarial prompts. We
demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through
manual analysis in GPT-3.5/4. We also list three future directions to further
enhance SELFDEFEND.
Related papers
- Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.04296423547049]
Large Language Models (LLMs) are widely applied in various domains.
We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs.
arXiv Detail & Related papers (2025-02-16T11:43:39Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.
We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.
Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF)
HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins.
It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z) - BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.
Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.
We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [34.36053833900958]
We present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks.
TAP generates prompts that jailbreak state-of-the-art LLMs for more than 80% of the prompts.
TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
arXiv Detail & Related papers (2023-12-04T18:49:23Z) - Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V.
By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts.
We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.