SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models
- URL: http://arxiv.org/abs/2509.26345v1
- Date: Tue, 30 Sep 2025 14:50:59 GMT
- Title: SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models
- Authors: Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang,
- Abstract summary: Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks.<n>But their growing power amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms.<n>We propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans.
- Score: 27.607151919652267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.
Related papers
- RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic [56.38397499463889]
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks.<n>However, they remain vulnerable to hazardous instructions that may trigger unsafe behaviors.<n>We propose RoboSafe, a runtime safeguard for embodied agents through executable predicate-based safety logic.
arXiv Detail & Related papers (2025-12-24T15:01:26Z) - Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction [51.50282796099369]
This paper develops a multi-dimensional instruction uncertainty reduction framework to generate semantically constrained adversarial examples.<n>By predicting the language-guided sampling process, the optimization process will be stabilized by the designed ResAdv-DDIM sampler.<n>We realize the reference-free generation of semantically constrained 3D adversarial examples for the first time.
arXiv Detail & Related papers (2025-10-27T04:02:52Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models [0.0]
camouflaged jailbreaking embeds malicious intent within seemingly benign language to evade existing safety mechanisms.<n>This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods.
arXiv Detail & Related papers (2025-09-05T19:57:38Z) - ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention [14.509085965856643]
We propose SafeIntervention (SafeInt), a novel defense method that shields large language models from jailbreak attacks.<n>Built on our analysis of the representations of jailbreak samples, the core idea of SafeInt is to relocate jailbreak-related representations into the rejection region.<n>We conduct comprehensive experiments covering six jailbreak attacks, two jailbreak datasets, and two utility benchmarks.
arXiv Detail & Related papers (2025-02-21T17:12:35Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.