How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference
- URL: http://arxiv.org/abs/2601.03288v1
- Date: Sun, 04 Jan 2026 07:54:24 GMT
- Title: How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference
- Authors: Songyang Liu, Chaozhuo Li, Rui Pu, Litian Zhang, Chenxu Wang, Zejian Chen, Yuting Zhang, Yiming Hei,
- Abstract summary: FJAR is a fine-grained jailbreak evaluation framework with anchored references.<n>We first categorize jailbreak responses into five fine-grained categories.<n>Then, we introduce a novel harmless tree decomposition approach to construct high-quality anchored references.
- Score: 20.565609053126384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Jailbreak attacks present a significant challenge to the safety of Large Language Models (LLMs), yet current automated evaluation methods largely rely on coarse classifications that focus mainly on harmfulness, leading to substantial overestimation of attack success. To address this problem, we propose FJAR, a fine-grained jailbreak evaluation framework with anchored references. We first categorized jailbreak responses into five fine-grained categories: Rejective, Irrelevant, Unhelpful, Incorrect, and Successful, based on the degree to which the response addresses the malicious intent of the query. This categorization serves as the basis for FJAR. Then, we introduce a novel harmless tree decomposition approach to construct high-quality anchored references by breaking down the original queries. These references guide the evaluator in determining whether the response genuinely fulfills the original query. Extensive experiments demonstrate that FJAR achieves the highest alignment with human judgment and effectively identifies the root causes of jailbreak failures, providing actionable guidance for improving attack strategies.
Related papers
- Bag of Tricks for Subverting Reasoning-based Safety Guardrails [62.139297207938036]
We present a bag of jailbreak methods that subvert the reasoning-based guardrails.<n>Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization.
arXiv Detail & Related papers (2025-10-13T16:16:44Z) - JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring [45.76641811031552]
We introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework.<n>Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision.<n>We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans.
arXiv Detail & Related papers (2025-08-28T14:40:27Z) - JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering [73.962469626788]
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus.<n>We propose JPS, underlineJailbreak MLLMs with collaborative visual underlinePerturbation and textual underlineSteering.
arXiv Detail & Related papers (2025-08-07T07:14:01Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models [21.854909839996612]
Jailbreak attacks induce Large Language Models (LLMs) to generate harmful responses.<n>There is no consensus on evaluating jailbreaks.<n>JailbreakEval is a toolkit for evaluating jailbreak attempts.
arXiv Detail & Related papers (2024-06-13T16:59:43Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content.
evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address.
JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z) - A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance.
It scores the harmfulness of a victim model's responses to forbidden prompts.
It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z) - AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models [29.92550386563915]
Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs)<n>We introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs.<n>We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation.
arXiv Detail & Related papers (2024-01-17T06:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.