Related papers: Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

URL: http://arxiv.org/abs/2508.19292v1
Date: Mon, 25 Aug 2025 14:16:30 GMT
Title: Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience
Authors: Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu,
Abstract summary: Large language models (LLMs) generate human-aligned content under certain safety constraints.<n>The textbfJailExpert framework is the first to achieve a formal representation of experience structure.<n>JailExpert achieves an average increase of 17% in attack success rate and 2.7 times improvement in attack efficiency.
Score: 36.525169416008886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) generate human-aligned content under certain safety constraints. However, the current known technique ``jailbreak prompt'' can circumvent safety-aligned measures and induce LLMs to output malicious content. Research on Jailbreaking can help identify vulnerabilities in LLMs and guide the development of robust security frameworks. To circumvent the issue of attack templates becoming obsolete as models evolve, existing methods adopt iterative mutation and dynamic optimization to facilitate more automated jailbreak attacks. However, these methods face two challenges: inefficiency and repetitive optimization, as they overlook the value of past attack experiences. To better integrate past attack experiences to assist current jailbreak attempts, we propose the \textbf{JailExpert}, an automated jailbreak framework, which is the first to achieve a formal representation of experience structure, group experiences based on semantic drift, and support the dynamic updating of the experience pool. Extensive experiments demonstrate that JailExpert significantly improves both attack effectiveness and efficiency. Compared to the current state-of-the-art black-box jailbreak methods, JailExpert achieves an average increase of 17\% in attack success rate and 2.7 times improvement in attack efficiency. Our implementation is available at \href{https://github.com/xiZAIzai/JailExpert}{XiZaiZai/JailExpert}

Related papers

Activation-Guided Local Editing for Jailbreaking Attacks [33.13949817155855]
Token-level jailbreak attacks often produce incoherent or unreadable inputs.<n> prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity.<n>We propose a concise and effective two-stage framework that combines the advantages of these approaches.
arXiv Detail & Related papers (2025-08-01T11:52:24Z)
MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs [14.530593083777502]
We propose MetaCipher, a low-cost, multi-agent jailbreak framework.<n>Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks.
arXiv Detail & Related papers (2025-06-27T18:15:56Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs [11.924542310342282]
We present JailPO, a novel black-box jailbreak framework to examine Large Language Models (LLMs) alignment.<n>For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts.<n>We also introduce a preference optimization-based attack method to enhance the jailbreak effectiveness.
arXiv Detail & Related papers (2024-12-20T07:29:10Z)
SQL Injection Jailbreak: A Structural Disaster of Large Language Models [71.55108680517422]
Large Language Models (LLMs) are susceptible to jailbreak attacks that can induce them to generate harmful content.<n>In this paper, we introduce a novel jailbreak method, which targets the external properties of LLMs.<n>By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.
arXiv Detail & Related papers (2024-11-03T13:36:34Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models [15.582860145268553]
JailFuzzer is a novel fuzzing framework driven by large language model (LLM) agents.<n>It generates natural and semantically coherent prompts, reducing the likelihood of detection by traditional defenses.<n>It achieves a high success rate in jailbreak attacks with minimal query overhead.
arXiv Detail & Related papers (2024-08-01T12:54:46Z)
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.