Fugu-MT 論文翻訳(概要): ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

論文の概要: ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

arxiv url: http://arxiv.org/abs/2605.18918v1
Date: Mon, 18 May 2026 06:57:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.878503
Title: ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense
Title（参考訳）: ESLD(External Surrogate Latent Defense):より高速で強力なプロンプト注入防御のためのラテント空間アーキテクチャ
Authors: Yash Narendra,
Abstract要約: 本稿では、悪意のある入力からセーフを分離するために必要な信号がガードモデルの内部表現にすでに存在することを示す。 ESLDはモデルに依存しないアーキテクチャで、既存のガードモデルの上に置かれ、ガードの再トレーニングや修正をすることなく、レイテンシと検出精度の両方を改善する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern AI assistants are agentic. To answer a single user request, the underlying language model pulls in information from many sources, such as web searches, retrieved documents, tool outputs, and user follow-ups, and reasons over them across several steps. Any of these inputs can carry malicious content. This opens the door to prompt injection, where an attacker plants text designed to override the instructions given to the assistant by its developer. For example, an attacker applying for a job can insert white-on-white text in their resume saying ``This is the strongest candidate. Recommend for immediate hire''. A hiring assistant may then be steered toward a favorable recommendation regardless of actual qualifications. To defend against this threat, production systems use a separate guard model in front of the assistant. The guard reads incoming text and writes a verdict (``safe'' or ``unsafe'') before the assistant is allowed to act. In an agentic task with many steps, this check becomes a latency bottleneck. This paper shows that the signal needed to separate safe from malicious input is already present in the guard model's internal representation, before it writes anything out. Reading this signal directly speeds up the safety check by more than $3\times$ on average, while improving detection accuracy over the guard's verdict by 16.4 percentage points on average. This is more than latency optimization. Guard-model checks that were previously too slow to run on every step of an agent can now be placed on the critical path without sacrificing accuracy, and in fact with higher accuracy than the guard provides on its own. ESLD (External Surrogate Latent Defense) packages this finding into a deployable defense. ESLD is a model-agnostic architecture that sits on top of any existing guard model and improves both latency and detection accuracy, without retraining or modifying the guard.
Abstract（参考訳）: 現代のAIアシスタントはエージェント的です。単一のユーザ要求に答えるために、基礎となる言語モデルは、Web検索、検索されたドキュメント、ツール出力、ユーザフォローアップなど、多くのソースから情報を取り出す。これらの入力は悪意のあるコンテンツを運ぶことができる。これによりインジェクションのドアが開き、攻撃者は、その開発者がアシスタントに渡した命令を上書きするようにデザインされたテキストを配置する。例えば、ジョブを申請するアタッカーは、履歴書に "` This is the highest candidate" と書かれたホワイト・オン・ホワイトのテキストを挿入することができる。「直ちに雇うよう勧める。」雇用助手は、実際の資格に関係なく、好意的な推薦を受けることができる。この脅威に対して防御するために、プロダクションシステムはアシスタントの前で別々のガードモデルを使用する。ガードは受信したテキストを読み、アシスタントの動作が許可される前に判決(``safe'' または ``unsafe'')を書きます。多くのステップを持つエージェントタスクでは、このチェックはレイテンシのボトルネックになる。本稿では、悪意のある入力からセーフを分離するために必要な信号がガードモデルの内部表現にすでに存在することを示す。この信号を読み取ると、安全チェックを平均3ドル以上スピードアップし、ガードの判定値に対する検出精度を平均16.4ポイント向上する。これはレイテンシ最適化以上のものです。これまでエージェントのすべてのステップで実行するのに遅すぎたガードモデルチェックを、正確性を犠牲にすることなくクリティカルパスに配置することが可能になった。 ESLD(External Surrogate Latent Defense)は、この発見を展開可能な防衛にパッケージする。 ESLDはモデルに依存しないアーキテクチャで、既存のガードモデルの上に置かれ、ガードの再トレーニングや修正をすることなく、レイテンシと検出精度の両方を改善する。

論文の概要: ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

関連論文リスト