Fugu-MT 論文翻訳(概要): Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

論文の概要: Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

arxiv url: http://arxiv.org/abs/2606.07335v1
Date: Fri, 05 Jun 2026 14:49:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.792776
Title: Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
Title（参考訳）: Manifold Trajectory Kinetics による大規模言語モデルの脱獄攻撃
Authors: Hangtao Zhang, Yucheng Zhao, Sishun Liu, Ziqi Zhou, Zeyu Ye, Wei Wan, Minghui Li, Shengshan Hu, Yanjun Zhang, Yi Liu, Leo Yu Zhang,
Abstract要約: ジェイルブレイクプロンプトは、大規模な言語モデルにおけるアライメントガードレールをバイパスすることができる。先行検出アプローチは固定距離空間に大きく依存する。この仮定は,意図によって無視されるが,安全関連キーワードを含む疑似悪質なプロンプトの下で破られることを示す。本稿では, LLM を入力を出力に変換する運動系として扱う Manifold Trajectory Kinetics (MTK) を提案する。
参考スコア（独自算出の注目度）: 50.36375380196006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak prompts can bypass alignment guardrails in large language models (LLMs) and elicit unsafe outputs, making reliable deployment-time detection critical. Prior detection approaches largely rely on a fixed metric space, e.g., raw inputs, gradients, or hidden features, in which benign and jailbreak prompts are linearly separable. We show this assumption breaks under (i) pseudo-malicious prompts that are benign by intent but contain safety-related keywords, and (ii) adaptive attacks that explicitly optimize against the deployed detector. To overcome this limitation, we shift our focus from identifying a universal metric space to analyzing the more robust neighborhood structure of the underlying data manifold. We present Manifold Trajectory Kinetics (MTK), which treats an LLM as a kinetic system transforming inputs into outputs and detects jailbreaks by tracking how a prompt's neighborhood structure evolves across layers. Benign prompts remain close to benign neighborhoods throughout inference, whereas jailbreak prompts exhibit a characteristic trajectory that begins near malicious seeds and later strategically shifts toward benign neighborhoods to evade refusal.Across four LLMs and ten jailbreak attacks, MTK achieves strong robustness to both failure modes: on pseudo-malicious prompts, it attains a jailbreak true positive rate of 95% at a false positive rate of 5% on benign prompts and 2% on pseudo-malicious prompts, and under adaptive attacks, it maintains a true positive rate of 85%. We further demonstrate the superior performance of MTK for jailbreak detection in vision-language models. Our code is available at https://github.com/Rookie143/mtk.
Abstract（参考訳）: Jailbreakプロンプトは、大規模な言語モデル(LLM)のアライメントガードレールをバイパスし、安全でない出力を誘発し、信頼性の高いデプロイメント時間検出が重要になる。事前検出アプローチは、例えば生の入力、勾配、隠れた特徴といった固定された距離空間に大きく依存しており、良性および脱獄プロンプトは線形に分離可能である。我々はこの仮定が破滅することを示す。一目的によって良心するが、安全に関するキーワードを含む擬似誤認のプロンプト (ii) 配置された検出器に対して明示的に最適化する適応攻撃。この制限を克服するために、我々は、普遍距離空間の特定から、基礎となるデータ多様体のより堅牢な近傍構造の分析へと焦点を移した。本稿では,LLMを入力を出力に変換する運動系として扱うManifold Trajectory Kinetics (MTK)について述べる。 4つのLSMと10のジェイルブレイク攻撃において、MTKは両方の障害モードに対して強い堅牢性を実現している:擬似不正なプロンプトでは、偽偽のプロンプトでは5%、擬似不正なプロンプトでは2%の偽陽性率でジェイルブレイク真正率は95%、適応的な攻撃では85%である。さらに,視覚言語モデルにおけるジェイルブレイク検出におけるMTKの優れた性能を示す。私たちのコードはhttps://github.com/Rookie143/mtk.comから入手可能です。

論文の概要: Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

関連論文リスト