Fugu-MT 論文翻訳(概要): The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

論文の概要: The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

arxiv url: http://arxiv.org/abs/2603.27412v1
Date: Sat, 28 Mar 2026 21:19:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.9445
Title: The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
Title（参考訳）: 有害物体の幾何学:LLM残留流における角偏差による無訓練異常検出
Authors: Isaac Llorente-Saguer,
Abstract要約: 本研究では,大規模言語モデルにおける残差ストリームアクティベーションの幾何を分析し,有害なプロンプトを検出するためのトレーニング不要な方法であるLatentBiopsyを提案する。我々はQwen3.5-0.8BファミリーとQwen2.5-0.5Bファミリーの2つの完全モデル三重項を評価した。 latentBiopsyは、有害なvs-ノルミティブ検出のためのAUROC$geq$0.937と、良性攻撃的プロンプトから有害なプロンプトを識別するためのAUROC = 1.000を達成している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $θ$ from this reference direction. The anomaly score is the negative log-likelihood of $θ$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($σ_θ\approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($σ_θ\approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.
Abstract（参考訳）: 本研究では,大規模言語モデルにおける残差ストリームアクティベーションの幾何を分析し,有害なプロンプトを検出するためのトレーニング不要な方法であるLatentBiopsyを提案する。 200の安全な規範的プロンプトが与えられた後、LatntBiopsyはターゲット層でのアクティベーションの主成分を計算し、この基準方向からの半径偏差角$θ$で新しいプロンプトを特徴付ける。異常スコアは、正規分布に適合するガウス分布の下で$θ$の負の対数類似度であり、向きに関係なく対称に偏差をフラグする。トレーニングには有害な例は必要ない。本研究では,Qwen3.5-0.8B群とQwen2.5-0.5B群(ベース,インストラクション調整,およびemph{abliterated})の2種類の完全モデル三重項について検討した。 latentBiopsyは、有害なvs-normativeな検出のためにAUROC $\geq$0.937を、良性攻撃的プロンプト(XSTest)から有害な攻撃的プロンプトを識別するためにAUROC = 1.000を達成している。 3つの経験的発見が現れる。まず、幾何は拒絶のアブレーションを生き残る: 両方の失活した変種は、命令で調整された変種より0.015以下でAUROCを達成し、有害なインテント表現と下流の生成的拒絶機構の間の幾何学的解離を確立する。第二に、有害なプロンプトはほぼ縮退した角分布(σ_θ\approx 0.03$ rad)を示すが、これは標準分布(σ_θ\approx 0.27$ rad)よりも厳密であり、収差を含む全てのアライメント段階で保存される。有害なプロンプトはQwen3.5-0.8Bの外側の環を占有するが、Qwen2.5-0.5Bの内側の環は方向に依存しないスコアリングを直接動機付ける。

論文の概要: The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

関連論文リスト