Related papers: JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

URL: http://arxiv.org/abs/2508.20848v1
Date: Thu, 28 Aug 2025 14:40:27 GMT
Title: JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring
Authors: Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang,
Abstract summary: We introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework.<n>Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision.<n>We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans.
Score: 45.76641811031552
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Accurately determining whether a jailbreak attempt has succeeded is a fundamental yet unresolved challenge. Existing evaluation methods rely on misaligned proxy indicators or naive holistic judgments. They frequently misinterpret model responses, leading to inconsistent and subjective assessments that misalign with human perception. To address this gap, we introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework. Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision. JADES also incorporates an optional fact-checking module to strengthen the detection of hallucinations in jailbreak responses. We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work, consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans. In a binary setting (success/failure), JADES achieves 98.5% agreement with human evaluators, outperforming strong baselines by over 9%. Re-evaluating five popular attacks on four LLMs reveals substantial overestimation (e.g., LAA's attack success rate on GPT-3.5-Turbo drops from 93% to 69%). Our results show that JADES could deliver accurate, consistent, and interpretable evaluations, providing a reliable basis for measuring future jailbreak attacks.

Related papers

How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference [20.565609053126384]
FJAR is a fine-grained jailbreak evaluation framework with anchored references.<n>We first categorize jailbreak responses into five fine-grained categories.<n>Then, we introduce a novel harmless tree decomposition approach to construct high-quality anchored references.
arXiv Detail & Related papers (2026-01-04T07:54:24Z)
Self-HarmLLM: Can Large Language Model Harm Itself? [10.208363125551555]
We propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input.<n>We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions.
arXiv Detail & Related papers (2025-10-31T02:23:54Z)
GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods [10.603857042090521]
We conduct a systematic measurement study based on 37 jailbreak studies since 2022.<n>We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications.<n>We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines -- GuidedEval.
arXiv Detail & Related papers (2025-02-24T06:57:27Z)
Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [50.89022445197919]
Large language models (LLMs) have exhibited outstanding performance in engaging with humans. LLMs are vulnerable to jailbreak attacks, leading to the generation of harmful responses. We propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs.
arXiv Detail & Related papers (2024-10-15T10:07:15Z)
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models [21.854909839996612]
Jailbreak attacks induce Large Language Models (LLMs) to generate harmful responses.<n>There is no consensus on evaluating jailbreaks.<n>JailbreakEval is a toolkit for evaluating jailbreak attempts.
arXiv Detail & Related papers (2024-06-13T16:59:43Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Rethinking How to Evaluate Language Model Jailbreak [16.301224741410312]
We propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems.
arXiv Detail & Related papers (2024-04-09T15:54:16Z)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)
A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance. It scores the harmfulness of a victim model's responses to forbidden prompts. It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.