Fugu-MT 論文翻訳(概要): Before the Last Token: Diagnosing Final-Token Safety Probe Failures

論文の概要: Before the Last Token: Diagnosing Final-Token Safety Probe Failures

arxiv url: http://arxiv.org/abs/2605.12726v1
Date: Tue, 12 May 2026 20:30:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.675319
Title: Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Title（参考訳）: 最終トーケン前:最終トーケンの安全試験失敗の診断
Authors: Shravan Doda,
Abstract要約: 最終トーケンの安全プローブは、プロンプトプリフィルの後、単一の隠れ状態を監視する。クリーンで有害なプロンプトと良性なプロンプトのみをトレーニングしたSafeSwitch型プローブを用いて,このプリフィル時間障害モードについて検討した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe's representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final-token readout, while naive max-pooling over token positions overfires on safe prompts. A simple PCA-HMM trajectory model, trained only on the same clean split, recovers many final-token misses from user-content prefill trajectories without the catastrophic false-positive behavior of naive token pooling, motivating trajectory-aware hidden-state analyses as diagnostic complements to final-token probes
Abstract（参考訳）: しかし、Jailbreakプロンプトには、このリードアウトで見逃された以前のユーザトークン表現に分散された、プローブ可視の安全でない証拠が含まれている可能性がある。クリーンで有害なプロンプトと良性なプロンプトのみをトレーニングしたSafeSwitch型プローブを用いて,このプリフィル時間障害モードについて検討した。これらのプローブは、クリーンな有害なプロンプトに対する高いリコールを達成するが、多くのジェイルブレイクを見逃し、安全に配慮した良心的なプロンプトに対する偽陽性を発生させる可能性がある。サブスペース解析により、見逃されたジェイルブレイクは、プローブの表現サブスペースによって捕捉されていない方向に沿って、クリーンな良性プロンプトとは異なることが示唆され、プローブのボトルネック幅の増大は、このミスマッチを確実に解決しない。トークンレベルのプリフィル分析では、プローブ可視の安全でない証拠は、しばしばシーケンスの早い段階で現れるが、最終トーケン読み出しでは露出しない。簡単なPCA-HMMトラジェクトリモデルでは、同じクリーンスプリットでのみ訓練され、ナイーブトークンプーリングの破滅的な偽陽性動作を伴わず、最終トーケンプローブの診断補完として軌道認識隠れ状態解析を動機づけることなく、ユーザ・コンテンツ・プレフィル・トラジェクトリから多くの最終トーケンミスを回復する。

論文の概要: Before the Last Token: Diagnosing Final-Token Safety Probe Failures

関連論文リスト