Related papers: CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

URL: http://arxiv.org/abs/2510.08829v1
Date: Thu, 09 Oct 2025 21:32:02 GMT
Title: CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
Authors: Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader,
Abstract summary: We present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions.<n>Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs.<n>This approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs.
Score: 17.941502260254673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

Related papers

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement [120.52289344734415]
We propose an automated framework for stealthy prompt injection tailored to agent skills.<n>The framework forms a closed loop with three agents: an Attack Agent that synthesizes injection skills under explicit stealth constraints, a Code Agent that executes tasks using the injected skills and an Evaluate Agent that logs action traces.<n>Our method consistently achieves high attack success rates under realistic settings.
arXiv Detail & Related papers (2026-02-15T16:09:48Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
Defense Against Indirect Prompt Injection via Tool Result Parsing [5.69701430275527]
LLM agents face an escalating threat from indirect prompt injection.<n>This vulnerability poses a significant risk as agents gain more direct control over physical environments.<n>We propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code.
arXiv Detail & Related papers (2026-01-08T10:21:56Z)
Soft Instruction De-escalation Defense [36.36851291734834]
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment.<n>This makes them susceptible to prompt injections when dealing with untrusted data.<n>We propose SIC-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents.
arXiv Detail & Related papers (2025-10-24T00:04:07Z)
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [92.26240528996443]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z)
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations [10.746349111023964]
We introduce a novel approach that injects the IH signal into the intermediate token representations within the network.<n>Our method augments these representations with layer-specific trainable embeddings that encode the privilege information.<n>Our evaluations across multiple models and training methods reveal that our proposal yields between $1.6times$ and $9.2times$ reduction in attack success rate.
arXiv Detail & Related papers (2025-05-25T00:01:39Z)
Defending against Indirect Prompt Injection by Instruction Detection [109.30156975159561]
InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.<n>As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.<n>Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment [35.344406718760574]
A prompt injection attack aims to make an Large Language Model follow an injected prompt to perform an attacker-chosen task.<n>Existing attacks primarily focus on crafting these injections at inference time, treating the LLM itself as a static target.<n>In this work, we introduce a more foundational attack vector: poisoning the LLM's alignment process to amplify the success of future prompt injection attacks.
arXiv Detail & Related papers (2024-10-18T18:52:16Z)
Automatic and Universal Prompt Injection Attacks against Large Language Models [38.694912482525446]
Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. These attacks manipulate applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data.
arXiv Detail & Related papers (2024-03-07T23:46:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.