Related papers: Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

URL: http://arxiv.org/abs/2508.03054v1
Date: Tue, 05 Aug 2025 03:58:15 GMT
Title: Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning
Authors: Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang,
Abstract summary: Defending large language models against jailbreak attacks is essential for their safe and reliable deployment.<n>We propose the Cognitive-Driven Defense framework, which targets the underlying structure of jailbreak prompts by applying meta-operations.<n> Experiments demonstrate that CDD can achieve state-of-the-art defense performance and exhibit strong generalization to unseen jailbreak attacks.
Score: 12.2605782566148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Defending large language models (LLMs) against jailbreak attacks is essential for their safe and reliable deployment. Existing defenses often rely on shallow pattern matching, which struggles to generalize to novel and unseen attack strategies. To address this challenge, we propose the Cognitive-Driven Defense (CDD) framework, which targets the underlying structure of jailbreak prompts by applying meta-operations, defined as basic manipulations that conceal harmful intent.CDD emulates human cognitive reasoning through a structured reasoning chain. It begins with a global perception of the prompt and follows with a localized analysis to uncover hidden manipulations. By applying supervised fine-tuning on this structured chain, the model learns to identify and reason about known manipulation patterns. To enhance generalization to unseen threats, an entropy-guided reinforcement learning algorithm (EG-GRPO) is introduced to encourage exploration of new types and variants of meta-operations. Experiments demonstrate that CDD can achieve state-of-the-art defense performance and exhibit strong generalization to unseen jailbreak attacks.

Related papers

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring [13.497048408038935]
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks.<n>Current anomaly-detection methods tend to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection.<n>We propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations.
arXiv Detail & Related papers (2025-12-12T22:31:38Z)
The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z)
Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations [0.9732319879728966]
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior.<n>This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions.
arXiv Detail & Related papers (2025-11-24T09:38:11Z)
KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs [22.335638814557004]
We propose a Knowledge Graph Defense Framework (KG-DF) for large language models (LLMs)<n>Because of its structured knowledge representation and semantic association capabilities, Knowledge Graph(KG) can be searched by associating input content with safe knowledge in the knowledge base.<n>We introduce an semantic parsing module, whose core task is to transform the input query into a set of structured and secure concept representations.
arXiv Detail & Related papers (2025-11-09T14:39:40Z)
Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection [1.8374319565577155]
Jailbreaking techniques pose a significant threat to the safety of Large Language Models.<n>To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge.<n>We developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families.
arXiv Detail & Related papers (2025-10-14T12:34:41Z)
Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses [6.736255552371404]
Alignment is one of the main approaches used to defend against attacks such as prompt injection and jailbreaks.<n>Recent defenses report near-zero Attack Success Rates (ASR) even against Greedy Coordinate Gradient (GCG)
arXiv Detail & Related papers (2025-05-21T16:43:17Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs [4.534938642552179]
ShieldLearner is a novel paradigm that mimics human learning in defense.<n>Through trial and error, it autonomously distills attack signatures into a Pattern Atlas.<n> Adaptive Adversarial Augmentation generates adversarial variations of successfully defended prompts.
arXiv Detail & Related papers (2025-02-16T18:47:41Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Jailbreaking? One Step Is Enough! [6.142918017301964]
Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs.<n>We propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense" intention.<n>To enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples.
arXiv Detail & Related papers (2024-12-17T07:33:41Z)
Jailbreak Attacks and Defenses Against Large Language Models: A Survey [22.392989536664288]
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks. "jailbreaking" induces the model to generate malicious responses against the usage policy and society. We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
arXiv Detail & Related papers (2024-07-05T06:57:30Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Jailbroken: How Does LLM Safety Training Fail? [92.8748773632051]
"jailbreak" attacks on early releases of ChatGPT elicit undesired behavior. We investigate why such attacks succeed and how they can be created. New attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests.
arXiv Detail & Related papers (2023-07-05T17:58:10Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.