Fugu-MT 論文翻訳(概要): CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

論文の概要: CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

arxiv url: http://arxiv.org/abs/2510.08829v1
Date: Thu, 09 Oct 2025 21:32:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.801765
Title: CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization
Title（参考訳）: CommandSans: 外科的精密なプロンプト衛生機能を備えたAIエージェントのセキュア化
Authors: Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader,
Abstract要約: 本稿では,データに実行可能命令を含まないという,コンピュータセキュリティの基本原理に着想を得た新しいアプローチを提案する。サンプルレベルの分類の代わりに,ツール出力からAIシステムに指示された指示を外科的に除去するトークンレベルの衛生プロセスを提案する。このアプローチは非ブロッキングであり、キャリブレーションを必要とせず、ツール出力のコンテキストに依存しない。
参考スコア（独自算出の注目度）: 17.941502260254673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.
Abstract（参考訳）: 多数のツールや機密データにアクセス可能なLLMエージェントの採用が増加し、間接的なプロンプトインジェクションの攻撃面が大幅に拡大した。しかし、攻撃の文脈に依存した性質のため、現在の防御は、悪質で良質な指示を確実に区別できないため、しばしば不合理化されている。そこで本研究では,データに実行可能命令を含まないという,コンピュータセキュリティの基本原理に着想を得た新しいアプローチを提案する。サンプルレベルの分類の代わりに、ツール出力からAIシステムに指示された命令を外科的に除去し、悪意のある命令を副産物としてキャプチャするトークンレベルの衛生プロセスを提案する。既存の安全分類器とは対照的に、このアプローチは非ブロッキングであり、校正を必要とせず、ツール出力の文脈に依存しない。さらに,これらのトークンレベルの予測器を手軽に使用可能な命令チューニングデータのみでトレーニングすることが可能であり,課題や他の合成起源からの非現実的なプロンプトインジェクションの例に頼る必要もない。実験の結果、AgentDojo、BIPIA、InjecAgent、ASB、SEPといった幅広い攻撃やベンチマークにおいて、AgentDojoでは7～10倍の攻撃成功率(ASR)を達成し(34%～3%)、悪質な設定でもエージェントユーティリティを損なうことなく、このアプローチが一般化していることが判明した。

論文の概要: CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

関連論文リスト