Related papers: EnJa: Ensemble Jailbreak on Large Language Models

EnJa: Ensemble Jailbreak on Large Language Models

URL: http://arxiv.org/abs/2408.03603v1
Date: Wed, 7 Aug 2024 07:46:08 GMT
Title: EnJa: Ensemble Jailbreak on Large Language Models
Authors: Jiahao Zhang, Zilong Wang, Ruofan Wang, Xingjun Ma, Yu-Gang Jiang,
Abstract summary: Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
Score: 69.13666224876408
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks -- malicious prompts that can disable the safety mechanism of LLMs -- has attracted growing research attention. While alignment methods have been proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. Existing jailbreak attacks on LLMs can be categorized into prompt-level methods which make up stories/logic to circumvent safety alignment and token-level attack methods which leverage gradient methods to find adversarial tokens. In this work, we introduce the concept of Ensemble Jailbreak and explore methods that can integrate prompt-level and token-level jailbreak into a more powerful hybrid jailbreak attack. Specifically, we propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector. We evaluate the effectiveness of EnJa on several aligned models and show that it achieves a state-of-the-art attack success rate with fewer queries and is much stronger than any individual jailbreak.

Related papers

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning [48.100552417137656]
PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions.<n>We conduct extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
arXiv Detail & Related papers (2025-09-28T01:38:00Z)
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models [18.06388944779541]
"jailbreaking" is the use of large language models to trigger unintended behaviors.<n>We propose a novel method to balance the jailbreak attack success rate with semantic coherence.<n>Our method is superior to state-of-the-art baselines in attack effectiveness.
arXiv Detail & Related papers (2025-02-17T02:49:26Z)
Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.04296423547049]
Large Language Models (LLMs) are widely applied in various domains. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs.
arXiv Detail & Related papers (2025-02-16T11:43:39Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
We propose a novel jailbreak method, which utilizes the construction of input prompts by LLMs to inject jailbreak information into user prompts. Our SIJ method achieves nearly 100% attack success rates on five well-known open-source LLMs in the context of AdvBench.
arXiv Detail & Related papers (2024-11-03T13:36:34Z)
Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models [0.0]
We propose a novel black-box jailbreak attacking framework that incorporates various LLM-as-Attacker methods. Our method is designed based on three key observations from existing jailbreaking studies and practices.
arXiv Detail & Related papers (2024-10-31T01:55:33Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z)
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts. We introduce BOOST, a simple attack that leverages only the eos tokens. Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z)
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak [62.56901628534646]
This paper focuses on jailbreaking attacks against large language models (LLMs)<n>Our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness.
arXiv Detail & Related papers (2024-05-30T12:50:32Z)
Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs. Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates. We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z)
Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs) A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP) Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z)
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [34.36053833900958]
We present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks. TAP generates prompts that jailbreak state-of-the-art LLMs for more than 80% of the prompts. TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
arXiv Detail & Related papers (2023-12-04T18:49:23Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.