Fugu-MT 論文翻訳(概要): CourtGuard: A Local, Multiagent Prompt Injection Classifier

論文の概要: CourtGuard: A Local, Multiagent Prompt Injection Classifier

arxiv url: http://arxiv.org/abs/2510.19844v1
Date: Mon, 20 Oct 2025 20:10:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:16.325067
Title: CourtGuard: A Local, Multiagent Prompt Injection Classifier
Title（参考訳）: CourtGuard: ローカルでマルチエージェントのpromptインジェクション分類器
Authors: Isaac Wu, Michael Maslowski,
Abstract要約: プロンプトインジェクション攻撃は、大きな言語モデル(LLM)が機密データを漏洩させ、誤情報を広げ、有害な振る舞いを示す可能性がある。このような攻撃に対して防御するために,ローカルに実行可能なマルチエージェントインジェクションインジェクション分類器であるCourtGuardを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) become integrated into various sensitive applications, prompt injection, the use of prompting to induce harmful behaviors from LLMs, poses an ever increasing risk. Prompt injection attacks can cause LLMs to leak sensitive data, spread misinformation, and exhibit harmful behaviors. To defend against these attacks, we propose CourtGuard, a locally-runnable, multiagent prompt injection classifier. In it, prompts are evaluated in a court-like multiagent LLM system, where a "defense attorney" model argues the prompt is benign, a "prosecution attorney" model argues the prompt is a prompt injection, and a "judge" model gives the final classification. CourtGuard has a lower false positive rate than the Direct Detector, an LLM as-a-judge. However, CourtGuard is generally a worse prompt injection detector. Nevertheless, this lower false positive rate highlights the importance of considering both adversarial and benign scenarios for the classification of a prompt. Additionally, the relative performance of CourtGuard in comparison to other prompt injection classifiers advances the use of multiagent systems as a defense against prompt injection attacks. The implementations of CourtGuard and the Direct Detector with full prompts for Gemma-3-12b-it, Llama-3.3-8B, and Phi-4-mini-instruct are available at https://github.com/isaacwu2000/CourtGuard.
Abstract（参考訳）: 大型言語モデル(LLMs)が様々な敏感なアプリケーションに統合されるにつれて、LSMから有害な振る舞いを誘発するインジェクションのプロンプトの使用は、ますます増加するリスクを引き起こす。プロンプト・インジェクション・アタックは、LSMが機密データを漏洩させ、誤報を拡散させ、有害な行動を示す可能性がある。このような攻撃に対して防御するために,ローカルに実行可能なマルチエージェントインジェクションインジェクション分類器であるCourtGuardを提案する。裁判所のようなマルチエージェントLPMシステムでは、プロンプトが評価され、「防衛弁護士」モデルはプロンプトが良性であると主張し、「検察弁護士」モデルはプロンプトがプロンプト注入であると主張し、「ジャッジ」モデルは最終分類を与える。 CourtGuard の偽陽性率は、LSM as-a-judge である Direct Detector よりも低い。しかし、CourtGuardは一般的に、より悪いインジェクション検出器である。それでも、この低い偽陽性率は、プロンプトの分類において、敵対的シナリオと良性シナリオの両方を考慮することの重要性を強調している。さらに、他のプロンプトインジェクション分類器と比較して、CourtGuardの相対的な性能は、プロンプトインジェクション攻撃に対する防御としてマルチエージェントシステムを使うことを前進させる。 CourtGuardとDirect DetectorのGemma-3-12b-it、Llama-3.3-8B、Phi-4-mini-instructの実装はhttps://github.com/isaacwu2000/CourtGuardで利用可能である。

論文の概要: CourtGuard: A Local, Multiagent Prompt Injection Classifier

関連論文リスト