Defending Jailbreak Prompts via In-Context Adversarial Game
- URL: http://arxiv.org/abs/2402.13148v2
- Date: Fri, 5 Jul 2024 00:03:24 GMT
- Title: Defending Jailbreak Prompts via In-Context Adversarial Game
- Authors: Yujun Zhou, Yufei Han, Haomin Zhuang, Kehan Guo, Zhenwen Liang, Hongyan Bao, Xiangliang Zhang,
- Abstract summary: We introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning.
Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents.
Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates.
- Score: 34.83853184278604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.
Related papers
- Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.
We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.
Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - No Free Lunch for Defending Against Prefilling Attack by In-Context Learning [14.156913670221867]
We show that In-Context Learning (ICL) can effectively defend against prefilling jailbreak attacks by employing adversative sentence structures within demonstrations.
Given the experimental results and our analysis, we conclude that there is no free lunch for defending against prefilling jailbreak attacks with ICL.
arXiv Detail & Related papers (2024-12-13T23:58:12Z) - Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, MLLMs remain vulnerable to jailbreak attacks.
We propose Immune, an inference-time defense framework that leverages a safe reward model to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z) - Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models [0.0]
We propose a novel black-box jailbreak attacking framework that incorporates various LLM-as-Attacker methods.
Our method is designed based on three key observations from existing jailbreaking studies and practices.
arXiv Detail & Related papers (2024-10-31T01:55:33Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF)
HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins.
It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Jailbreak Attacks and Defenses Against Large Language Models: A Survey [22.392989536664288]
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks.
"jailbreaking" induces the model to generate malicious responses against the usage policy and society.
We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
arXiv Detail & Related papers (2024-07-05T06:57:30Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.