Fugu-MT 論文翻訳(概要): Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

論文の概要: Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

arxiv url: http://arxiv.org/abs/2602.08062v1
Date: Sun, 08 Feb 2026 17:11:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:24.962067
Title: Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation
Title（参考訳）: ブートストラップアグリゲーションによる致死性LDMプロンプトの高効率かつ適応性検出
Authors: Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini,
Abstract要約: Black-boxモデレーションAPIは、限られた透明性を提供し、進化する脅威に不適応する。大きなLLM判事を用いたホワイトボックスのアプローチは、計算コストを禁止し、新しい攻撃に対して高価な再訓練を必要とする。本稿では,モジュール型で軽量で段階的に更新可能なフレームワークであるBAGELについて紹介する。
参考スコア（独自算出の注目度）: 4.467773944156384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自然言語理解、推論、生成において顕著な能力を示した。しかし、これらのシステムは、有害な要求、ジェイルブレイク技術、インジェクション攻撃を通じて、安全でないまたはポリシーに違反する行動を誘発する悪意のあるプロンプトに影響を受けやすいままである。ブラックボックスのモデレーションAPIは、透明性が制限され、進化する脅威に適応しにくい一方で、大規模なLCM判事を使ったホワイトボックスのアプローチでは、禁止的な計算コストを課し、新しい攻撃に対して高価な再訓練を必要とする。現在のシステムでは、デザイナはパフォーマンス、効率、適応性を選択せざるを得ない。これらの課題に対処するため、悪意のあるプロンプト検出のためのモジュール的で軽量で漸進的に更新可能なフレームワークであるBAGEL(Bootstrap AGgregated Ensemble Layer)を紹介します。 BAGELはブートストラップアグリゲーションと、専門家にインスパイアされた細調整されたモデルのアンサンブルを組み合わせており、それぞれが異なる攻撃データセットに特化している。推測において、BAGELはランダムなフォレストルータを使用して最も適したアンサンブルメンバーを特定し、次に確率的選択を適用して予測アグリゲーションのための追加メンバーをサンプリングする。新しい攻撃が発生すると、BAGELは小さなプロンプトセーフティ分類器(86Mパラメータ)を微調整し、その結果のモデルをアンサンブルに追加することで、段階的に更新する。 BAGELは、わずか5つのアンサンブルメンバ(430Mパラメータ)を選択し、数十億のパラメータを必要とするOpenAIモデレーションAPIとShieldGemmaを上回り、F1スコアの0.92を達成する。 9回のインクリメンタルアップデートの後、パフォーマンスは引き続き堅牢であり、BAGELはルータの構造的特徴を通じて解釈性を提供する。この結果から,小型の微調整器のアンサンブルは10億パラメータガードレールに適合または超過し,生産システムに必要とされる適応性と効率性を実現することができることがわかった。

論文の概要: Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

関連論文リスト