Fugu-MT 論文翻訳(概要): AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents

論文の概要: AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents

arxiv url: http://arxiv.org/abs/2606.15057v2
Date: Fri, 19 Jun 2026 04:47:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-23 13:41:30.735653
Title: AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents
Title（参考訳）: AutoDojo: LLMエージェントにおけるIPI防御の限界とタスク特定効果を明らかにする適応的なブラックボックス攻撃
Authors: Xinhang Ma, Taoran Li, Chaowei Xiao, Zhiyuan Yu, Ning Zhang, Yevgeniy Vorobeychik,
Abstract要約: 間接的プロンプトインジェクション(IPI)は、LLMを動力とするエージェントに対する主要なセキュリティ脅威である。我々は、特定の防御に対してIPIを最適化するAgentDojoの適応的な拡張であるAutoDojoを開発した。
参考スコア（独自算出の注目度）: 57.34566159148893
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. Thus, a growing body of work have proposed a variety of defensive approaches against IPI. These can be grouped into three broad categories: 1) prompt-based (using prompting as a way to prevent agents from following malicious instructions), 2) detection-based (identifying and filtering malicious instructions), and 3) system-level (using systems insights, such as control and data isolation, for defense). However, commonly used benchmarks for evaluating defense, such as AgentDojo, are \emph{inherently static}, generating a fixed distribution of IPI attacks. Consequently, static benchmarks do not usefully evaluate defense robustness to adaptive threats. We address this issue by developing AutoDojo, an adaptive extension of AgentDojo that optimizes IPI against a given defense. Using AutoDojo against state-of-the-art IPI defenses across three task suites and five target models, we make two key observations. First, many defenses offer only limited protection: a cheap, black-box adaptive attack using a frontier LLM to iteratively optimize the injection raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. Against a filter that reduces static ASR to 0\%, AutoDojo recovers 28\% overall and 64\% on action-open tasks. Second, for prompt-level and filter-based defenses, ASR is substantially higher on \emph{action-open} tasks -- where the user's request delegates the action itself to attacker-controlled content -- than on precisely specified tasks. This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction-like text. AutoDojo is publicly available at https://github.com/xhOwenMa/AutoDojo.
Abstract（参考訳）: 間接的プロンプトインジェクション(IPI)は、LLMを動力とするエージェントに対する主要なセキュリティ脅威である。そのため、IPIに対する様々な防御的アプローチが提案されている。これらは3つの大きなカテゴリに分類できる。 1)プロンプトベース(エージェントが悪意のある指示に従うのを防ぐためのプロンプトを利用する) 2 検出ベース(悪意のある指示の特定及びフィルタリング)及び 3)システムレベル(制御やデータアイソレーションなどのシステムインサイトを使用して、防御)。しかし、AgentDojoのような防御を評価するためによく使われるベンチマークは \emph{inherently static} であり、IPI攻撃の固定分布を生成する。その結果、静的なベンチマークでは、アダプティブな脅威に対する防御ロバスト性を評価できない。我々は、与えられた防御に対してIPIを最適化するAgentDojoの適応的な拡張であるAutoDojoを開発することでこの問題に対処する。 3つのタスクスイートと5つのターゲットモデルにわたる最先端ITI防御に対してAutoDojoを使用することで、2つの重要な観察を行う。まず、多くの防御は限定的な防御しか提供しない:フロンティアLSMを用いた安価なブラックボックス適応攻撃は、ほぼすべての評価された防御に対して静的な注入によって達成された攻撃成功率(ASR)をはるかに上回っている。静的 ASR を 0 % に下げるフィルタに対して、AutoDojo は、アクションオープンタスクで 28 % と 64 % を回復する。第二に、プロンプトレベルとフィルタベースのディフェンスでは、ASRは、正確に指定されたタスクよりも、ユーザの要求がアクション自体をアタッカー制御されたコンテンツに委譲する \emph{action-open} タスクの方がはるかに高い。このようなタスクでは、インジェクションは明示的な命令ではなく通常のデータとして機能し、命令のようなテキストの検出に依存するディフェンスをバイパスする。 AutoDojoはhttps://github.com/xhOwenMa/AutoDojo.comで公開されている。

論文の概要: AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents

関連論文リスト