Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs
- URL: http://arxiv.org/abs/2406.08754v2
- Date: Fri, 19 Jul 2024 08:23:38 GMT
- Title: Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs
- Authors: Bangxin Li, Hengrui Xing, Chao Huang, Jin Qian, Huangqing Xiao, Linfeng Feng, Cong Tian,
- Abstract summary: This paper focuses on studying how prompt structure contributes to the jailbreak attack.
We introduce a novel structure-level attack method based on tail structures that are rarely used during LLM training.
We build an automated jailbreak tool named StructuralSleight that contains three escalating attack strategies.
- Score: 5.7998356650620035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are widely used in natural language processing but face the risk of jailbreak attacks that maliciously induce them to generate harmful content. Existing jailbreak attacks, including character-level and context-level attacks, mainly focus on the prompt of the plain text without specifically exploring the significant influence of its structure. In this paper, we focus on studying how prompt structure contributes to the jailbreak attack. We introduce a novel structure-level attack method based on tail structures that are rarely used during LLM training, which we refer to as Uncommon Text-Encoded Structure (UTES). We extensively study 12 UTESs templates and 6 obfuscation methods to build an effective automated jailbreak tool named StructuralSleight that contains three escalating attack strategies: Structural Attack, Structural and Character/Context Obfuscation Attack, and Fully Obfuscated Structural Attack. Extensive experiments on existing LLMs show that StructuralSleight significantly outperforms baseline methods. In particular, the attack success rate reaches 94.62\% on GPT-4o, which has not been addressed by state-of-the-art techniques.
Related papers
- StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models [3.0308780927465135]
We present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces.
Our simplest attacks can achieve close to 90% success rate, even on strict LLMs.
We develop a benchmark and evaluate existing safety-alignment defenses against it, showing that most of them fail with 100% ASR.
arXiv Detail & Related papers (2025-02-17T14:46:38Z) - Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs [33.87649859430635]
Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks.
We introduce a novel jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs.
Our method achieves attack success rates of over 90%,80% and 74%, respectively, exceeding existing baselines by more than 60%.
arXiv Detail & Related papers (2024-09-23T10:03:09Z) - h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment [48.5611060845958]
We propose a novel benchmark of composable jailbreak attacks to move beyond static datasets and of attacks and harms.
We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs.
Several of our synthesized attacks are more effective than previously reported ones, with Attack Success rates exceeding 90% on SOTA closed language models.
arXiv Detail & Related papers (2024-08-09T01:45:39Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs [13.317364896194903]
Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner.
LLMs are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs.
arXiv Detail & Related papers (2024-06-13T17:01:40Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - Unveiling Vulnerability of Self-Attention [61.85150061213987]
Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes.
This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism.
We introduce textitS-Attend, a novel smoothing technique that effectively makes SA robust via structural perturbations.
arXiv Detail & Related papers (2024-02-26T10:31:45Z) - DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers [74.7446827091938]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack)
DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.