Fugu-MT 論文翻訳(概要): Defending Against Prompt Injection with DataFilter

論文の概要: Defending Against Prompt Injection with DataFilter

arxiv url: http://arxiv.org/abs/2510.19207v1
Date: Wed, 22 Oct 2025 03:30:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:15.027817
Title: Defending Against Prompt Injection with DataFilter
Title（参考訳）: DataFilterによるプロンプト注入に対する防御
Authors: Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, David Wagner,
Abstract要約: 大規模言語モデル(LLM)エージェントは、タスクの自動化や信頼できない外部データとのインタラクションのために、ますます多くデプロイされている。 LLMがアクセスするデータに悪意のある命令を注入することで、攻撃者は元のユーザタスクを任意にオーバーライドし、意図しない潜在的有害なアクションにエージェントをリダイレクトすることができる。テスト時間モデルに依存しないディフェンスであるDataFilterを提案し、バックエンドのLCMに到達する前にデータから悪意ある命令を除去する。
参考スコア（独自算出の注目度）: 7.1507566575747346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When large language model (LLM) agents are increasingly deployed to automate tasks and interact with untrusted external data, prompt injection emerges as a significant security threat. By injecting malicious instructions into the data that LLMs access, an attacker can arbitrarily override the original user task and redirect the agent toward unintended, potentially harmful actions. Existing defenses either require access to model weights (fine-tuning), incur substantial utility loss (detection-based), or demand non-trivial system redesign (system-level). Motivated by this, we propose DataFilter, a test-time model-agnostic defense that removes malicious instructions from the data before it reaches the backend LLM. DataFilter is trained with supervised fine-tuning on simulated injections and leverages both the user's instruction and the data to selectively strip adversarial content while preserving benign information. Across multiple benchmarks, DataFilter consistently reduces the prompt injection attack success rates to near zero while maintaining the LLMs' utility. DataFilter delivers strong security, high utility, and plug-and-play deployment, making it a strong practical defense to secure black-box commercial LLMs against prompt injection. Our DataFilter model is released at https://huggingface.co/JoyYizhu/DataFilter for immediate use, with the code to reproduce our results at https://github.com/yizhu-joy/DataFilter.
Abstract（参考訳）: 大きな言語モデル(LLM)エージェントがタスクの自動化や信頼できない外部データとのインタラクションのためにデプロイされるようになると、迅速なインジェクションが重大なセキュリティ上の脅威として現れます。 LLMがアクセスするデータに悪意のある命令を注入することで、攻撃者は元のユーザタスクを任意にオーバーライドし、意図しない潜在的有害なアクションにエージェントをリダイレクトすることができる。既存の防御は、モデルウェイト(微調整)へのアクセス、実質的なユーティリティ損失(検出ベース)、あるいは非自明なシステム再設計(システムレベル)を必要とする。そこで我々はDataFilterを提案する。DataFilterはテスト時間モデルに依存しないディフェンスで、バックエンドのLCMに到達する前にデータから悪意ある命令を除去する。 DataFilterは、シミュレーションインジェクションの教師付き微調整で訓練され、ユーザの命令とデータの両方を利用して、良質な情報を保持しながら、敵対的コンテンツを選択的に削除する。複数のベンチマークで、DataFilterはLSMのユーティリティを維持しながら、インジェクション攻撃の成功率をほぼゼロに抑える。 DataFilterは強力なセキュリティ、高ユーティリティ、プラグイン・アンド・プレイのデプロイメントを提供する。私たちのDataFilterモデルは、すぐに使えるようにhttps://huggingface.co/JoyYizhu/DataFilterでリリースされています。

論文の概要: Defending Against Prompt Injection with DataFilter

関連論文リスト