Fugu-MT 論文翻訳(概要): ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

論文の概要: ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

arxiv url: http://arxiv.org/abs/2601.03600v1
Date: Wed, 07 Jan 2026 05:30:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-08 18:12:46.134526
Title: ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification
Title（参考訳）: ALERT: 内部離散増幅によるゼロショットLDMジェイルブレイク検出
Authors: Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, Hanghang Tong,
Abstract要約: 既存の検出方法は、トレーニングデータに存在するジェイルブレイクテンプレートに依存するジェイルブレイクステータスを主に検出する。本稿では,階層的に,モジュール単位で,トークン単位での増幅フレームワークを提案する。これらの知見に基づいて、効率的なゼロショットジェイルブレイク検出器であるALERTを導入する。
参考スコア（独自算出の注目度）: 47.135407245022115
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.
Abstract（参考訳）: 豊富な安全アライメント戦略にもかかわらず、大きな言語モデル(LLM)は、安全ガードレールを侵害し、重大なセキュリティリスクを生じさせるジェイルブレイク攻撃の影響を受けやすいままである。既存の検出方法は、トレーニングデータに存在するジェイルブレイクテンプレートに依存するジェイルブレイクステータスを主に検出する。しかし、より現実的で挑戦的なジェイルブレイク検出設定に対処する研究はほとんどなく、トレーニング中にジェイルブレイクテンプレートが利用できない。この設定は、新たな攻撃が継続的に発生し進化する現実世界のシナリオを反映している。この課題に対処するため、我々は、階層的、モジュール的に、トークン的に増幅するフレームワークを提案し、ベニグンプロンプトとジェイルブレイクプロンプトの内的特徴の相違を徐々に拡大する。我々は、安全関連層を発見し、ゼロショット識別信号を本質的にエンコードする特定のモジュールを特定し、情報安全トークンをローカライズする。これらの知見に基づいて、ALERT (Amplification-based Jailbreak Detector) を導入し、効率よく効果的なゼロショットジェイルブレイク検出装置を導入し、増幅表現に2つの独立した補完的分類器を導入する。 3つの安全性ベンチマークの大規模な実験は、ALRTが一貫して強力なゼロショット検出性能を達成していることを示している。具体的には (i)すべてのデータセットと攻撃戦略において、ALERTは確実に上位2つのメソッドにランク付けし、 (ii)平均精度とF1スコアで2番目に高いベースラインを10%以上上回り、時には40%も上回ります。

論文の概要: ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

関連論文リスト