Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning
- URL: http://arxiv.org/abs/2501.07959v2
- Date: Sat, 01 Feb 2025 09:30:34 GMT
- Title: Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning
- Authors: Jiaqi Hua, Wanxu Wei,
- Abstract summary: Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos.<n>We propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.
Related papers
- Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.04296423547049]
Large Language Models (LLMs) are widely applied in various domains.
We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs.
arXiv Detail & Related papers (2025-02-16T11:43:39Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [57.86886012610389]
jailbreak attacks exploit vulnerabilities to elicit unintended or harmful outputs.<n>We introduce Layer-AdvPatcher, a novel methodology designed to defend against jailbreak attacks.<n>We conduct extensive experiments on two models, four benchmark datasets, and multiple state-of-the-art jailbreak benchmarks to demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.
We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.
We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z) - SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
In this paper, we introduce a novel jailbreak method, which targets the external properties of LLMs, specifically, the ways construct input prompts.<n>Our SIJ method achieves near 100% attack success rates on five well-known open LLMs on the AdvBench, while incurring lower time costs compared to previous methods.<n>To address this, we propose a simple defense method called SelfReminderKey to counter SIJ and demonstrate its effectiveness through experimental results.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.<n>Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.<n>We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z) - Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses [37.56003689042975]
Many-shot (up to hundreds) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability.
We propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool.
arXiv Detail & Related papers (2024-06-03T12:59:17Z) - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs)
It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator.
Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z) - Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks [34.95274579737075]
We propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs.
JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality.
We build the first comprehensive multi-modal attack dataset, containing 11,000 data items across 15 known attack types.
arXiv Detail & Related papers (2023-12-17T17:02:14Z) - FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models [11.517609196300217]
We introduce FuzzLLM, an automated fuzzing framework designed to proactively test and discover jailbreak vulnerabilities in Large Language Models (LLMs)
We utilize templates to capture the structural integrity of a prompt and isolate key features of a jailbreak class as constraints.
By integrating different base classes into powerful combo attacks and varying the elements of constraints and prohibited questions, FuzzLLM enables efficient testing with reduced manual effort.
arXiv Detail & Related papers (2023-09-11T07:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.