Fugu-MT 論文翻訳(概要): Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

論文の概要: Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

arxiv url: http://arxiv.org/abs/2602.09629v1
Date: Tue, 10 Feb 2026 10:17:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-11 20:17:43.494098
Title: Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
Title（参考訳）: 攻撃をやめて、防衛を診断する - LLMの安全性が損なわれる4つのチェックポイントフレームワーク
Authors: Hayfa Dhabhi, Kashyap Thimmaraju,
Abstract要約: 大きな言語モデル(LLM)は有害な出力を防ぐための安全メカニズムを配置するが、これらの防御は敵のプロンプトに弱いままである。 textbfFour-Checkpoint Frameworkを導入し、処理ステージ(インプット対出力)と検出レベル(リテラル対インテント)の2次元に沿って安全メカニズムを整理する。 GPT-5, Claude Sonnet 4, Gemini 2.5 Proを3,312個の単ターンブラックボックステストケースで評価した。
参考スコア（独自算出の注目度）: 0.2291770711277359
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.
Abstract（参考訳）: 大きな言語モデル(LLM)は有害な出力を防ぐための安全メカニズムを配置するが、これらの防御は敵のプロンプトに弱いままである。既存の調査では、jailbreak攻撃が成功することを示しているが、‘textit{where}’ディフェンスが失敗するか、‘textit{why}’を説明できない。このギャップに対処するため、LLMの安全性は個別のチェックポイントを持つシーケンシャルパイプラインとして機能することを提案する。このフレームワークは,処理段階(入出力vs.1)の2次元に沿って安全機構を整理する。出力)と検出レベル(リテラル対。の意)。これにより、CP1からCP4までの4つのチェックポイントが生成される。我々は13の回避手法を設計し、それぞれ特定のチェックポイントをターゲットにし、個々の防御層の制御テストを可能にする。 GPT-5, Claude Sonnet 4, Gemini 2.5 Proを3,312個の単ターンブラックボックステストケースで評価した。応答分類にはLLM-as-judgeアプローチを採用し、重度調整された測定基準である重み付き攻撃成功率(WASR)を導入し、バイナリ評価で見落としている部分的な情報漏洩を捉える。私たちの評価は明らかなパターンを明らかにします。伝統的なBinary ASRは22.6\%の攻撃成功を報告している。しかし、WASRは52.7\%、すなわち2.3$\times$高い脆弱性を明らかにしている。出力ステージディフェンス (CP3, CP4) は 72--79\% WASR で最も弱いが, 入力リテラルディフェンス (CP1) は 13\% WASR で最強である。クロードは最高安全性(42.8 % WASR)、続いてGPT-5(55.9 %)、ジェミニ(59.5 %)を達成している。これらの結果から,現在の防御はインプットリテラルチェックポイントにおいて最強であるが,インテントレベルの操作やアウトプットステージ技術には弱いことが示唆された。 Four-Checkpoint Frameworkは、デプロイされたシステムの安全性上の脆弱性を特定し、対処するための構造化されたアプローチを提供する。

論文の概要: Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

関連論文リスト