Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
- URL: http://arxiv.org/abs/2602.05386v2
- Date: Fri, 06 Feb 2026 13:03:26 GMT
- Title: Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
- Authors: Zhenxiong Yu, Zhi Yang, Zhiheng Jin, Shuhe Wang, Heng Zhang, Yanlin Fei, Lingfeng Zeng, Fangqi Lou, Shuo Zhang, Tu Hu, Jingping Liu, Rongze Chen, Xingyu Zhu, Kunyi Wang, Chaofa Yuan, Xin Guo, Zhaowei Liu, Feipeng Zhang, Jie Huang, Huacan Wang, Ronghao Chen, Liwen Zhang,
- Abstract summary: We argue that effective agent security should be intrinsic and selective rather than architecturally decoupled and mandatory.<n>We propose Spider-Sense framework, which allows agents to maintain latent vigilance and trigger defenses only upon risk perception.<n>Spider-Sense achieves competitive or superior defense performance, attaining the lowest Attack Success Rate (ASR) and False Positive Rate (FPR)
- Score: 23.066685616914807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) evolve into autonomous agents, their real-world applicability has expanded significantly, accompanied by new security challenges. Most existing agent defense mechanisms adopt a mandatory checking paradigm, in which security validation is forcibly triggered at predefined stages of the agent lifecycle. In this work, we argue that effective agent security should be intrinsic and selective rather than architecturally decoupled and mandatory. We propose Spider-Sense framework, an event-driven defense framework based on Intrinsic Risk Sensing (IRS), which allows agents to maintain latent vigilance and trigger defenses only upon risk perception. Once triggered, the Spider-Sense invokes a hierarchical defence mechanism that trades off efficiency and precision: it resolves known patterns via lightweight similarity matching while escalating ambiguous cases to deep internal reasoning, thereby eliminating reliance on external models. To facilitate rigorous evaluation, we introduce S$^2$Bench, a lifecycle-aware benchmark featuring realistic tool execution and multi-stage attacks. Extensive experiments demonstrate that Spider-Sense achieves competitive or superior defense performance, attaining the lowest Attack Success Rate (ASR) and False Positive Rate (FPR), with only a marginal latency overhead of 8.3\%.
Related papers
- ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction [24.416258744287166]
ICON is a probing-to-mitigation framework that neutralizes attacks while preserving task continuity.<n>ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain.
arXiv Detail & Related papers (2026-02-24T09:13:05Z) - SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment [43.86865924673546]
We propose SafeThinker, an adaptive framework that allocates defensive resources via a lightweight gateway classifier.<n> Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising robustness.
arXiv Detail & Related papers (2026-01-23T07:12:53Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents [1.014002853673217]
LLM agents are vulnerable to Indirect Prompt Injection (IPI) attacks.<n>IPI attacks hijack agent behavior by polluting external information sources.<n>We propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision.
arXiv Detail & Related papers (2025-12-07T08:11:19Z) - Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? [58.48689960350828]
We show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security with high utility.<n>We employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer)
arXiv Detail & Related papers (2025-10-06T18:09:02Z) - AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning [78.5751183537704]
AdvEvo-MARL is a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents.<n>Rather than relying on external guards, AdvEvo-MARL jointly optimize attackers and defenders.
arXiv Detail & Related papers (2025-10-02T02:06:30Z) - MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z) - RADEP: A Resilient Adaptive Defense Framework Against Model Extraction Attacks [6.6680585862156105]
We introduce a Resilient Adaptive Defense Framework for Model Extraction Attack Protection (RADEP)<n>RADEP employs progressive adversarial training to enhance model resilience against extraction attempts.<n> Ownership verification is enforced through embedded watermarking and backdoor triggers.
arXiv Detail & Related papers (2025-05-25T23:28:05Z) - ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense [12.836334933428738]
Existing defenses rely on "Safety Checks", which struggle to capture the complex semantic risks posed by harmful user inputs or unsafe agent behaviors.<n>We propose a novel defense framework, ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning)<n>ALRPHFS consists of two core components: (1) an offline adversarial self-learning loop to iteratively refine a generalizable and balanced library of risk patterns, and (2) an online hierarchical fast & slow reasoning engine that balances detection effectiveness with computational efficiency.
arXiv Detail & Related papers (2025-05-25T18:31:48Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.