Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts
Against Open-source LLMs
- URL: http://arxiv.org/abs/2402.14872v2
- Date: Tue, 27 Feb 2024 13:49:22 GMT
- Title: Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts
Against Open-source LLMs
- Authors: Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien
Chang
- Abstract summary: Large Language Models (LLMs) generate text based on input sequences but are vulnerable to jailbreak attacks.
Jailbreak prompts are semantically more varied than the original questions used for queries.
We introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question.
- Score: 30.8029926520819
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), used in creative writing, code generation, and
translation, generate text based on input sequences but are vulnerable to
jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak
prompt methods use a combination of jailbreak templates followed by questions
to ask to create jailbreak prompts. However, existing jailbreak prompt designs
generally suffer from excessive semantic differences, resulting in an inability
to resist defenses that use simple semantic metrics as thresholds. Jailbreak
prompts are semantically more varied than the original questions used for
queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach
that bypasses LLMs by generating jailbreak prompts that are semantically
similar to the original question. We model the search for jailbreak prompts
that satisfy both semantic similarity and jailbreak validity as a
multi-objective optimization problem and employ a standardized set of genetic
algorithms for generating eligible prompts. Compared to the baseline
AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4%
higher without ONION defense and 85.2% higher with ONION defense. SMJ's better
performance in all three semantic meaningfulness metrics of Jailbreak Prompt,
Similarity, and Outlier, also means that SMJ is resistant to defenses that use
those metrics as thresholds.
Related papers
- Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs [33.87649859430635]
Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks.
We introduce a novel jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs.
Our method achieves attack success rates of over 90%,80% and 74%, respectively, exceeding existing baselines by more than 60%.
arXiv Detail & Related papers (2024-09-23T10:03:09Z) - HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF)
HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins.
It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content.
evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address.
JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z) - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs)
It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator.
Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z) - LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper [16.078682415975337]
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs)
This paper proposes a lightweight yet practical defense called SELFDEFEND.
It can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.
arXiv Detail & Related papers (2024-02-24T05:34:43Z) - Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [34.36053833900958]
We present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks.
TAP generates prompts that jailbreak state-of-the-art LLMs for more than 80% of the prompts.
TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
arXiv Detail & Related papers (2023-12-04T18:49:23Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.