A Causal Perspective for Enhancing Jailbreak Attack and Defense
- URL: http://arxiv.org/abs/2602.04893v1
- Date: Sat, 31 Jan 2026 15:20:13 GMT
- Title: A Causal Perspective for Enhancing Jailbreak Attack and Defense
- Authors: Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren,
- Abstract summary: We propose a framework that integrates large language models into data-driven causal discovery.<n>We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven language models.<n>Our analysis reveals that specific features, such as "Positive Character" and "Number of Task Steps", act as direct causal drivers of jailbreaks.
- Score: 29.669194815878768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Uncovering the mechanisms behind "jailbreaks" in large language models (LLMs) is crucial for enhancing their safety and reliability, yet these mechanisms remain poorly understood. Existing studies predominantly analyze jailbreak prompts by probing latent representations, often overlooking the causal relationships between interpretable prompt features and jailbreak occurrences. In this work, we propose Causal Analyst, a framework that integrates LLMs into data-driven causal discovery to identify the direct causes of jailbreaks and leverage them for both attack and defense. We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven LLMs, systematically constructed from 100 attack templates and 50 harmful queries, annotated with 37 meticulously designed human-readable prompt features. By jointly training LLM-based prompt encoding and GNN-based causal graph learning, we reconstruct causal pathways linking prompt features to jailbreak responses. Our analysis reveals that specific features, such as "Positive Character" and "Number of Task Steps", act as direct causal drivers of jailbreaks. We demonstrate the practical utility of these insights through two applications: (1) a Jailbreaking Enhancer that targets identified causal features to significantly boost attack success rates on public benchmarks, and (2) a Guardrail Advisor that utilizes the learned causal graph to extract true malicious intent from obfuscated queries. Extensive experiments, including baseline comparisons and causal structure validation, confirm the robustness of our causal analysis and its superiority over non-causal approaches. Our results suggest that analyzing jailbreak features from a causal perspective is an effective and interpretable approach for improving LLM reliability. Our code is available at https://github.com/Master-PLC/Causal-Analyst.
Related papers
- Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models [2.6140509675507384]
We study jailbreaking from both security and interpretability perspectives.<n>We propose a tensor-based latent representation framework that captures structure in hidden activations.<n>Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures.
arXiv Detail & Related papers (2026-02-12T02:43:17Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - BreakFun: Jailbreaking LLMs via Schema Exploitation [0.28647133890966986]
We investigate how Large Language Models (LLMs) can be turned into critical weaknesses.<n>This vulnerability is highly transferable, achieving an average success rate of 89% across 13 models.<n>A secondary LLM performs a "Literal Transcription" to isolate and reveal the user's true harmful intent.
arXiv Detail & Related papers (2025-10-19T11:27:44Z) - Machine Learning for Detection and Analysis of Novel LLM Jailbreaks [3.2654923574107357]
Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text.<n>These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies.<n>In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses.
arXiv Detail & Related papers (2025-10-02T03:55:29Z) - LLM Jailbreak Detection for (Almost) Free! [62.466970731998714]
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks.<n>Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences.<n>We propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts.
arXiv Detail & Related papers (2025-09-18T02:42:52Z) - ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [64.32925552574115]
ARMOR is a large language model that analyzes jailbreak strategies and extracts the core intent.<n> ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models [0.995531157345459]
Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education.<n>This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform.<n>We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior.
arXiv Detail & Related papers (2025-04-21T16:54:35Z) - xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [6.392966062933521]
We introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods.<n>We train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success.<n>To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction.
arXiv Detail & Related papers (2024-11-02T17:29:47Z) - Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance.
It scores the harmfulness of a victim model's responses to forbidden prompts.
It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.