Related papers: AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

URL: http://arxiv.org/abs/2401.09002v6
Date: Tue, 18 Mar 2025 01:50:42 GMT
Title: AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
Authors: Dong Shu, Chong Zhang, Mingyu Jin, Zihao Zhou, Lingyao Li, Yongfeng Zhang,
Abstract summary: Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs)<n>We introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs.<n>We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation.
Score: 29.92550386563915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs). To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the attacking prompts' effectiveness. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset is a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in prompt injection.

Related papers

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models [17.94525181892254]
Large Language Models (LLMs) have rapidly become integral to real-world applications, powering services across diverse sectors.<n>Their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs.<n>Despite intense research into both attack and defense techniques, the field remains fragmented: definitions, threat models, and evaluation criteria vary widely, impeding systematic progress and fair comparison.<n>Our work unifies fragmented research, provides rigorous foundations for future studies, and supports the development of robust, trustworthy LLMs suitable for high-stakes deployment.
arXiv Detail & Related papers (2025-10-17T09:38:54Z)
Rectifying Privacy and Efficacy Measurements in Machine Unlearning: A New Inference Attack Perspective [42.003102851493885]
We propose RULI (Rectified Unlearning Evaluation Framework via Likelihood Inference) to address critical gaps in the evaluation of inexact unlearning methods.<n>RULI introduces a dual-objective attack to measure both unlearning efficacy and privacy risks at a per-sample granularity.<n>Our findings reveal significant vulnerabilities in state-of-the-art unlearning methods, exposing privacy risks underestimated by existing methods.
arXiv Detail & Related papers (2025-06-16T00:30:02Z)
A Critical Evaluation of Defenses against Prompt Injection Attacks [95.81023801370073]
Large Language Models (LLMs) are vulnerable to prompt injection attacks.<n>Several defenses have recently been proposed, often claiming to mitigate these attacks successfully.<n>We argue that existing studies lack a principled approach to evaluating these defenses.
arXiv Detail & Related papers (2025-05-23T19:39:56Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise. We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks [26.422616504640786]
We propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We create a million-scale dataset, CC1M, and use it to conduct the first million-scale adversarial robustness evaluation of adversarially-trained ImageNet models.
arXiv Detail & Related papers (2024-11-20T10:41:23Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks. We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem. We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense [42.56467639172508]
Model Inversion (MI) attacks aim at leveraging the output information of target models to reconstruct privacy-sensitive training data. We build the first practical benchmark named MIBench for systematic evaluation of model inversion attacks and defenses.
arXiv Detail & Related papers (2024-10-07T16:13:49Z)
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics. Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models. It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications [8.51254190797079]
We introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected.
arXiv Detail & Related papers (2024-06-10T18:57:22Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks. We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks. We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)
Deep-Attack over the Deep Reinforcement Learning [26.272161868927004]
adversarial attack developments have made reinforcement learning more vulnerable. We propose a reinforcement learning-based attacking framework by considering the effectiveness and stealthy spontaneously. We also propose a new metric to evaluate the performance of the attack model in these two aspects.
arXiv Detail & Related papers (2022-05-02T10:58:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.