Fugu-MT 論文翻訳(概要): Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

論文の概要: Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

arxiv url: http://arxiv.org/abs/2605.29629v1
Date: Thu, 28 May 2026 09:02:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 05:02:24.565478
Title: Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures
Title（参考訳）: 攻撃成功率を超える - LLMの安全性障害に対する時間的ロジットオブザーバビリティ
Authors: Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee,
Abstract要約: 攻撃成功率(ASR)は、各ジェイルブレイクを世代末に1つのye/noラベルで評価する。私たちはこれらの隠れたパスを、ロジットだけで観察できるようにします。
参考スコア（独自算出の注目度）: 9.42946566157669
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.
Abstract（参考訳）: アタック成功率(ASR)は、各ジェイルブレイクを世代末に1つのye/noラベルで評価し、障害が発生したか、どのように展開されたかを示す。同様に有害な出力を生み出す2つの攻撃は、全く異なる経路を辿った可能性があり、ASRはそれらを区別できない。私たちはこれらの隠れたパスを、ロジットだけで観察できるようにします。テンポラルロジットオブザーバビリティ(TLO)は、デコード中のコンプライアンス・リフレクション・マージンを監視し、各モデル・アタック条件を校正された2次元平面上に配置する、トレーニング不要の診断である。設計上、この飛行機はASRが最も情報に乏しい場所であり、真に異なる理由で成功した攻撃の中でも最も情報に富んでいる。 4つのLLMと3つのジェイルブレイクのパラダイムにまたがって、ほぼ同一のASRが平面上のはっきりと異なる地点に着地する攻撃は、同じモデルが異なる時間パターンで失敗する可能性がある。幾何学は、ほとんどの条件における隠れ状態からの拒絶方向プローブと一致し、固定辞書アプローチの限界を示す1つのモデルを示す。 TLOから派生した単純なアーリーストップルールは、単純な良質なクエリに対する誤った警告なしで、成功したジェイルブレイクを半分以上削減する。安全評価は、障害が発生したかどうかだけでなく、いつ、どのように失敗が展開するかを報告すべきである。 TLOは、最初の2つをログだけで観測できる。

論文の概要: Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

関連論文リスト