Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection
- URL: http://arxiv.org/abs/2510.13893v1
- Date: Tue, 14 Oct 2025 12:34:41 GMT
- Title: Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection
- Authors: Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi,
- Abstract summary: Jailbreaking techniques pose a significant threat to the safety of Large Language Models.<n>To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge.<n>We developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families.
- Score: 1.8374319565577155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.
Related papers
- Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models [2.6140509675507384]
We study jailbreaking from both security and interpretability perspectives.<n>We propose a tensor-based latent representation framework that captures structure in hidden activations.<n>Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures.
arXiv Detail & Related papers (2026-02-12T02:43:17Z) - Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks [57.08407099520887]
Defending against novel jailbreaks represents a critical challenge in AI safety.<n>This paper proposes a new paradigm for improving robustness against unseen jailbreaks.
arXiv Detail & Related papers (2025-10-24T17:37:25Z) - Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is [8.214994509812724]
Large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks.<n>This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms.<n>We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models.
arXiv Detail & Related papers (2025-07-29T13:55:23Z) - A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models [6.946931840176725]
This work specifically focuses on the challenge of jailbreak vulnerabilities.<n>It introduces a novel taxonomy of jailbreak attacks grounded in the training domains of large language models.
arXiv Detail & Related papers (2025-04-07T12:05:16Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z) - What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [8.485286811635557]
We introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods.<n>We train probes to classify successful from unsuccessful jailbreaks using the latent representations corresponding to prompt tokens.<n>This reveals that different jailbreaking strategies exploit different non-linear, non-universal features.
arXiv Detail & Related papers (2024-11-02T17:29:47Z) - Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models [0.0]
We propose a novel black-box jailbreak attacking framework that incorporates various LLM-as-Attacker methods.<n>Our method is designed based on three key observations from existing jailbreaking studies and practices.
arXiv Detail & Related papers (2024-10-31T01:55:33Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - Jailbreak Attacks and Defenses Against Large Language Models: A Survey [22.392989536664288]
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks.
"jailbreaking" induces the model to generate malicious responses against the usage policy and society.
We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
arXiv Detail & Related papers (2024-07-05T06:57:30Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content.
evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address.
JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z) - JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks [34.95274579737075]
JailGuard is a universal detection framework for prompt-based attacks across text and image modalities.<n>It operates on the principle that attacks are inherently less robust than benign ones.<n>It achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
arXiv Detail & Related papers (2023-12-17T17:02:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.