Intention Analysis Makes LLMs A Good Jailbreak Defender
- URL: http://arxiv.org/abs/2401.06561v3
- Date: Mon, 29 Apr 2024 16:40:57 GMT
- Title: Intention Analysis Makes LLMs A Good Jailbreak Defender
- Authors: Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao,
- Abstract summary: In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($mathbbIA$)
The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process.
$mathbbIA$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness.
- Score: 79.4014719271075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning large language models (LLMs) with human values, particularly in the face of complex and stealthy jailbreak attacks, presents a formidable challenge. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process: 1) essential intention analysis, and 2) policy-aligned response. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across ChatGLM, LLaMA2, Vicuna, MPT, DeepSeek, and GPT-3.5 show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -53.1% attack success rate) and maintain the general helpfulness. Encouragingly, with the help of our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 in terms of attack success rate. Further analyses present some insights into how our method works. To facilitate reproducibility, we release our code and scripts at: https://github.com/alphadl/SafeLLM_with_IntentionAnalysis.
Related papers
- xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.
We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)
We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.
We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.
We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z) - LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
LIAR (Leveraging Inference time Alignment to jailbReak) is a fast and efficient best-of-N approach tailored for jailbreak attacks.
Our results demonstrate that a best-of-N approach is a simple yet highly effective strategy for evaluating the robustness of aligned LLMs.
arXiv Detail & Related papers (2024-12-06T18:02:59Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
Challenge AI Safety by Humanizing LLMs [66.05593434288625]
This paper introduces a new perspective to jailbreak large language models (LLMs) as human-like communicators.
We apply a persuasion taxonomy derived from decades of social science research to generate persuasive adversarial prompts (PAP) to jailbreak LLMs.
PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials.
On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses.
arXiv Detail & Related papers (2024-01-12T16:13:24Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [39.829517061574364]
Even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks"
We propose the generation exploitation attack, which disrupts model alignment by only manipulating variations of decoding methods.
Our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs.
arXiv Detail & Related papers (2023-10-10T20:15:54Z) - Safe Linear Bandits over Unknown Polytopes [39.177982674455784]
The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints.
We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes.
arXiv Detail & Related papers (2022-09-27T21:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.