AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
- URL: http://arxiv.org/abs/2602.22724v1
- Date: Thu, 26 Feb 2026 07:59:10 GMT
- Title: AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
- Authors: Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, Hongxin Hu,
- Abstract summary: We propose a novel inference-time detection and mitigation framework for large language model (LLM) agents.<n>AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover.<n>We evaluate AgentSentry on the textscAgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs.
- Score: 25.817251923574286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model (LLM) agents increasingly rely on external tools and retrieval systems to autonomously complete complex tasks. However, this design exposes agents to indirect prompt injection (IPI), where attacker-controlled context embedded in tool outputs or retrieved content silently steers agent actions away from user intent. Unlike prompt-based attacks, IPI unfolds over multi-turn trajectories, making malicious control difficult to disentangle from legitimate task execution. Existing inference-time defenses primarily rely on heuristic detection and conservative blocking of high-risk actions, which can prematurely terminate workflows or broadly suppress tool usage under ambiguous multi-turn scenarios. We propose AgentSentry, a novel inference-time detection and mitigation framework for tool-augmented LLM agents. To the best of our knowledge, AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover. It localizes takeover points via controlled counterfactual re-executions at tool-return boundaries and enables safe continuation through causally guided context purification that removes attack-induced deviations while preserving task-relevant evidence. We evaluate AgentSentry on the \textsc{AgentDojo} benchmark across four task suites, three IPI attack families, and multiple black-box LLMs. AgentSentry eliminates successful attacks and maintains strong utility under attack, achieving an average Utility Under Attack (UA) of 74.55 %, improving UA by 20.8 to 33.6 percentage points over the strongest baselines without degrading benign performance.
Related papers
- ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction [24.416258744287166]
ICON is a probing-to-mitigation framework that neutralizes attacks while preserving task continuity.<n>ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain.
arXiv Detail & Related papers (2026-02-24T09:13:05Z) - SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement [120.52289344734415]
We propose an automated framework for stealthy prompt injection tailored to agent skills.<n>The framework forms a closed loop with three agents: an Attack Agent that synthesizes injection skills under explicit stealth constraints, a Code Agent that executes tasks using the injected skills and an Evaluate Agent that logs action traces.<n>Our method consistently achieves high attack success rates under realistic settings.
arXiv Detail & Related papers (2026-02-15T16:09:48Z) - CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution [49.689452243966315]
AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks.<n>We propose CausalArmor, a selective defense framework that computes lightweight, leave-one-out attributions at privileged decision points.<n> Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses.
arXiv Detail & Related papers (2026-02-08T11:34:08Z) - ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback [53.2744585868162]
Monitoring step-level tool invocation behaviors in real time is critical for agent deployment.<n>We first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents.<n>We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning.
arXiv Detail & Related papers (2026-01-15T07:54:32Z) - BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents [58.83028403414688]
Large language model (LLM) agents execute tasks through multi-step workflow that combine planning, memory, and tool use.<n>Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs.<n>We propose textbfBackdoorAgent, a modular and stage-aware framework that provides a unified agent-centric view of backdoor threats in LLM agents.
arXiv Detail & Related papers (2026-01-08T03:49:39Z) - Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks [8.419049623790618]
This work analyzes three classes of semantic attacks on MCP-integrated systems.<n>We introduce a layered security framework with three components: RSA-based manifest signing to enforce descriptor integrity, LLM-on-LLM semantic vetting to detect suspicious tool definitions, and lightweight guardrails that block anomalous tool behavior at runtime.<n>Our results show that the proposed framework reduces unsafe tool invocation rates without model fine-tuning or internal modification.
arXiv Detail & Related papers (2025-12-06T20:07:58Z) - Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning [89.1856483797116]
We introduce BEAT, the first framework to inject visual backdoors into MLLM-based embodied agents.<n>Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably.<n>BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance.
arXiv Detail & Related papers (2025-10-31T16:50:49Z) - IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents [33.775221377823925]
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks.<n>When interacting with untrusted data sources, tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes.<n>We propose a novel defensive task execution paradigm, called IPIGuard, to prevent malicious tool invocations at the source.
arXiv Detail & Related papers (2025-08-21T07:08:16Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [3.5248694676821484]
We introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.
InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools.
We show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time.
arXiv Detail & Related papers (2024-03-05T06:21:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.