Related papers: Intention Analysis Makes LLMs A Good Jailbreak Defender

Intention Analysis Makes LLMs A Good Jailbreak Defender

URL: http://arxiv.org/abs/2401.06561v3
Date: Mon, 29 Apr 2024 16:40:57 GMT
Title: Intention Analysis Makes LLMs A Good Jailbreak Defender
Authors: Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao,
Abstract summary: In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($mathbbIA$) The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process. $mathbbIA$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness.
Score: 79.4014719271075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning large language models (LLMs) with human values, particularly in the face of complex and stealthy jailbreak attacks, presents a formidable challenge. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process: 1) essential intention analysis, and 2) policy-aligned response. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across ChatGLM, LLaMA2, Vicuna, MPT, DeepSeek, and GPT-3.5 show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -53.1% attack success rate) and maintain the general helpfulness. Encouragingly, with the help of our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 in terms of attack success rate. Further analyses present some insights into how our method works. To facilitate reproducibility, we release our code and scripts at: https://github.com/alphadl/SafeLLM_with_IntentionAnalysis.

Related papers

xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples. We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
LIAR (Leveraging Inference time Alignment to jailbReak) is a fast and efficient best-of-N approach tailored for jailbreak attacks. Our results demonstrate that a best-of-N approach is a simple yet highly effective strategy for evaluating the robustness of aligned LLMs.
arXiv Detail & Related papers (2024-12-06T18:02:59Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks. We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem. We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs) ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs [66.05593434288625]
This paper introduces a new perspective to jailbreak large language models (LLMs) as human-like communicators. We apply a persuasion taxonomy derived from decades of social science research to generate persuasive adversarial prompts (PAP) to jailbreak LLMs. PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses.
arXiv Detail & Related papers (2024-01-12T16:13:24Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [39.829517061574364]
Even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks" We propose the generation exploitation attack, which disrupts model alignment by only manipulating variations of decoding methods. Our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs.
arXiv Detail & Related papers (2023-10-10T20:15:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.