GuidedBench: Equipping Jailbreak Evaluation with Guidelines
- URL: http://arxiv.org/abs/2502.16903v1
- Date: Mon, 24 Feb 2025 06:57:27 GMT
- Title: GuidedBench: Equipping Jailbreak Evaluation with Guidelines
- Authors: Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang,
- Abstract summary: Jailbreaking methods for large language models (LLMs) have gained increasing attention for building safe and responsible AI systems.<n>In this paper, we introduce a more robust evaluation framework for jailbreak methods, with a curated harmful question dataset, detailed case-by-case evaluation guidelines, and a scoring system equipped with these guidelines.<n>Our experiments show that existing jailbreak methods exhibit better discrimination when evaluated using our benchmark.
- Score: 10.603857042090521
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Jailbreaking methods for large language models (LLMs) have gained increasing attention for building safe and responsible AI systems. After analyzing 35 jailbreak methods across six categories, we find that existing benchmarks, relying on universal LLM-based or keyword-matching scores, lack case-specific criteria, leading to conflicting results. In this paper, we introduce a more robust evaluation framework for jailbreak methods, with a curated harmful question dataset, detailed case-by-case evaluation guidelines, and a scoring system equipped with these guidelines. Our experiments show that existing jailbreak methods exhibit better discrimination when evaluated using our benchmark. Some jailbreak methods that claim to achieve over 90% attack success rate (ASR) on other benchmarks only reach a maximum of 30.2% on our benchmark, providing a higher ceiling for more advanced jailbreak research; furthermore, using our scoring system reduces the variance of disagreements between different evaluator LLMs by up to 76.33%. This demonstrates its ability to provide more fair and stable evaluation.
Related papers
- Rectifying Privacy and Efficacy Measurements in Machine Unlearning: A New Inference Attack Perspective [42.003102851493885]
We propose RULI (Rectified Unlearning Evaluation Framework via Likelihood Inference) to address critical gaps in the evaluation of inexact unlearning methods.<n>RULI introduces a dual-objective attack to measure both unlearning efficacy and privacy risks at a per-sample granularity.<n>Our findings reveal significant vulnerabilities in state-of-the-art unlearning methods, exposing privacy risks underestimated by existing methods.
arXiv Detail & Related papers (2025-06-16T00:30:02Z) - Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods [0.0]
This literature review consolidates the rapidly evolving field of AI safety evaluations.<n>It proposes a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks.
arXiv Detail & Related papers (2025-05-08T16:55:07Z) - Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z) - Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing [1.4201040196058878]
Large Language Models (LLMs) have transformed task automation and content generation across various domains.
We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass safety measures.
Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2% Success Rate (SR) in generating malicious code.
arXiv Detail & Related papers (2025-03-27T15:19:55Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [59.25318174362368]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.
We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.
We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z) - IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [64.46372846359694]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.
Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment.
For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs [13.317364896194903]
Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner.
LLMs are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs.
arXiv Detail & Related papers (2024-06-13T17:01:40Z) - JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models [21.854909839996612]
Jailbreak attacks induce Large Language Models (LLMs) to generate harmful responses.<n>There is no consensus on evaluating jailbreaks.<n>JailbreakEval is a toolkit for evaluating jailbreak attempts.
arXiv Detail & Related papers (2024-06-13T16:59:43Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Rethinking How to Evaluate Language Model Jailbreak [16.301224741410312]
We propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak.
We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems.
arXiv Detail & Related papers (2024-04-09T15:54:16Z) - JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content.
evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address.
JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z) - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs)
It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator.
Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z) - A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance.
It scores the harmfulness of a victim model's responses to forbidden prompts.
It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z) - AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models [29.92550386563915]
Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs)<n>We introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs.<n>We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation.
arXiv Detail & Related papers (2024-01-17T06:42:44Z) - Towards Evaluating Transfer-based Attacks Systematically, Practically,
and Fairly [79.07074710460012]
adversarial vulnerability of deep neural networks (DNNs) has drawn great attention.
An increasing number of transfer-based methods have been developed to fool black-box DNN models.
We establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods.
arXiv Detail & Related papers (2023-11-02T15:35:58Z) - Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z) - Towards Better Understanding Attribution Methods [77.1487219861185]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We also propose a post-processing smoothing step that significantly improves the performance of some attribution methods.
arXiv Detail & Related papers (2022-05-20T20:50:17Z) - Indicators of Attack Failure: Debugging and Improving Optimization of
Adversarial Examples [29.385242714424624]
evaluating robustness of machine-learning models to adversarial examples is a challenging problem.
We define a set of quantitative indicators which unveil common failures in the optimization of gradient-based attacks.
Our experimental analysis shows that the proposed indicators of failure can be used to visualize, debug and improve current adversarial robustness evaluations.
arXiv Detail & Related papers (2021-06-18T06:57:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.