JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks
- URL: http://arxiv.org/abs/2312.10766v3
- Date: Tue, 18 Jun 2024 02:21:02 GMT
- Title: JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks
- Authors: Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Ming Hu, Jie Zhang, Yang Liu, Shiqing Ma, Chao Shen,
- Abstract summary: We propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs.
JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality.
We build the first comprehensive multi-modal attack dataset, containing 11,000 data items across 15 known attack types.
- Score: 34.95274579737075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous applications. However, current LLMs are vulnerable to prompt-based attacks, with jailbreaking attacks enabling LLMs to generate harmful content, while hijacking attacks manipulate the model to perform unintended tasks, underscoring the necessity for detection methods. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs. JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants' responses on the model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. To evaluate the effectiveness of JailGuard, we build the first comprehensive multi-modal attack dataset, containing 11,000 data items across 15 known attack types. The evaluation suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
Related papers
- WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.
WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z) - Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs [13.317364896194903]
Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner.
They are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs.
arXiv Detail & Related papers (2024-06-13T17:01:40Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated.
We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting.
Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - Backdoor Activation Attack: Attack Large Language Models using
Activation Steering for Safety-Alignment [36.91218391728405]
This paper studies the vulnerability of Large Language Models' safety alignment.
Existing attack methods on LLMs rely on poisoned training data or the injection of malicious prompts.
Inspired by recent success in modifying model behavior through steering vectors without the need for optimization, we draw on its effectiveness in red-teaming LLMs.
Our experiment results show that activation attacks are highly effective and add little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.