Fugu-MT 論文翻訳(概要): AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

論文の概要: AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

arxiv url: http://arxiv.org/abs/2509.16861v1
Date: Sun, 21 Sep 2025 01:22:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.009075
Title: AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software
Title（参考訳）: AdaptiveGuard: LLM搭載ソフトウェアのアダプティブランタイム安全性を目指して
Authors: Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua,
Abstract要約: ガードレールは、LLM(Large Language Models)ベースのソフトウェアを安全にデプロイするために重要である。本稿では,新しい脱獄攻撃をアウト・オブ・ディストリビューション(OOD)入力として検出する適応ガードレールであるAdaptiveGuardを提案する。我々は、AdaptiveGuardがOOD検出精度96%を達成し、2回の更新ステップで新たな攻撃に適応し、85%以上のF1スコアを分散後のデータに保持していることを示す。
参考スコア（独自算出の注目度）: 11.606665113249298
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.
Abstract（参考訳）: ガードレールは、LLM(Large Language Models)ベースのソフトウェアを安全にデプロイするために重要である。安全でない振る舞いを本質的に制限する、限定された事前定義された入力出力空間を持つ従来のルールベースのシステムとは異なり、LLMはオープンエンドでインテリジェントなインタラクションを可能にし、ユーザ入力を通じてジェイルブレイク攻撃の扉を開く。ガードレールは保護層として機能し、LLMに到達する前に安全でないプロンプトをフィルタリングする。しかし、以前の研究では、GPT-4oのような先進的なモデルに対してさえ、脱獄攻撃が70%以上も成功することが示されている。 LlamaGuardのようなガードレールは95%の精度で報告されているが、予備的な分析では、不審な攻撃に直面した場合、その性能は12%まで急激に低下する可能性がある。デプロイ後ガードレールをどうやって構築すれば、新興の脅威に動的に適用できるのか? そこで我々は,新しい脱獄攻撃をアウト・オブ・ディストリビューション(OOD)入力として検出し,継続的な学習フレームワークを通じてそれらに対する防御を学習する適応ガードレールであるAdaptiveGuardを提案する。経験的評価を通じて、AdaptiveGuardはOOD検出精度96%を達成し、2回の更新ステップで新たな攻撃に適応し、分散後のデータに対するF1スコアを85%以上保持し、他のベースラインを上回っている。これらの結果は、AdaptiveGuardがデプロイ後のjailbreak戦略に反応して進化できるガードレールであることを示している。 AdaptiveGuardをリリースし、https://github.com/awsm-research/AdaptiveGuardでデータセットを研究し、さらなる研究を支援しています。

論文の概要: AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

関連論文リスト