Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
- URL: http://arxiv.org/abs/2602.09629v1
- Date: Tue, 10 Feb 2026 10:17:25 GMT
- Title: Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
- Authors: Hayfa Dhabhi, Kashyap Thimmaraju,
- Abstract summary: Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts.<n>We introduce the textbfFour-Checkpoint Framework, which organizes safety mechanisms along two dimensions: processing stage (input vs. output) and detection level (literal vs. intent)<n>Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases.
- Score: 0.2291770711277359
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.
Related papers
- What Matters For Safety Alignment? [38.86339753409445]
This paper presents a comprehensive empirical study on the safety alignment capabilities of AI systems.<n>We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques.<n>We identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models.
arXiv Detail & Related papers (2026-01-07T12:31:52Z) - AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications [71.27518152526686]
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation.<n>LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task.<n>This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types.
arXiv Detail & Related papers (2025-12-23T08:42:09Z) - Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations [0.9732319879728966]
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior.<n>This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions.
arXiv Detail & Related papers (2025-11-24T09:38:11Z) - Bag of Tricks for Subverting Reasoning-based Safety Guardrails [62.139297207938036]
We present a bag of jailbreak methods that subvert the reasoning-based guardrails.<n>Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization.
arXiv Detail & Related papers (2025-10-13T16:16:44Z) - DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety [55.30944259390733]
Deep research frameworks typically overlook crucial aspects of report quality such as credibility, coherence, breadth, depth, and safety.<n>We introduce DEEPRESEARCHGUARD, a comprehensive framework featuring four-stage safeguards with open-domain evaluation of references and reports.<n>Our evaluation spans diverse state-of-the-art LLMs, including GPT-4o, Gemini-2.5-flash, DeepSeek-v3, and o4-mini.
arXiv Detail & Related papers (2025-10-13T04:11:21Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation [29.8288014123234]
We investigate the vulnerability of intent-aware guardrails and demonstrate that large language models exhibit implicit intent detection capabilities.<n>We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives.<n>Our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses.
arXiv Detail & Related papers (2025-05-24T06:47:32Z) - Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses [10.08464073347558]
We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses.<n>We show Checkpoint-GCG to achieve up to $96%$ attack success rate (ASR) against the strongest defense.
arXiv Detail & Related papers (2025-05-21T16:43:17Z) - Decoding FL Defenses: Systemization, Pitfalls, and Remedies [16.907513505608666]
There are no guidelines for evaluating Federated Learning (FL) defenses.<n>We design a comprehensive systemization of FL defenses along three dimensions.<n>We survey 50 top-tier defense papers and identify the commonly used components in their evaluation setups.
arXiv Detail & Related papers (2025-02-03T23:14:02Z) - The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.<n>Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.