HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
and Robust Refusal
- URL: http://arxiv.org/abs/2402.04249v2
- Date: Tue, 27 Feb 2024 04:43:08 GMT
- Title: HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
and Robust Refusal
- Authors: Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman
Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan
Hendrycks
- Abstract summary: HarmBench is a standardized evaluation framework for automated red teaming.
We conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses.
We also introduce a highly efficient adversarial training method that greatly enhances robustness across a wide range of attacks.
- Score: 47.40508941209001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated red teaming holds substantial promise for uncovering and mitigating
the risks associated with the malicious use of large language models (LLMs),
yet the field lacks a standardized evaluation framework to rigorously assess
new methods. To address this issue, we introduce HarmBench, a standardized
evaluation framework for automated red teaming. We identify several desirable
properties previously unaccounted for in red teaming evaluations and
systematically design HarmBench to meet these criteria. Using HarmBench, we
conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs
and defenses, yielding novel insights. We also introduce a highly efficient
adversarial training method that greatly enhances LLM robustness across a wide
range of attacks, demonstrating how HarmBench enables codevelopment of attacks
and defenses. We open source HarmBench at
https://github.com/centerforaisafety/HarmBench.
Related papers
- AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration [40.350632196772466]
This paper introduces AutoRedTeamer, a novel framework for fully automated, end-to-end red teaming against large language models (LLMs)
AutoRedTeamer combines a multi-agent architecture with a memory-guided attack selection mechanism to enable continuous discovery and integration of new attack vectors.
We demonstrate AutoRedTeamer's effectiveness across diverse evaluation settings, achieving 20% higher attack success rates on HarmBench against Llama-3.1-70B.
arXiv Detail & Related papers (2025-03-20T00:13:04Z) - Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models [1.9574002186090496]
The rapid growth of Large Language Models (LLMs) presents significant privacy, security, and ethical concerns.
Researchers have recently complemented these efforts with an offensive approach that involves red teaming.
This paper provides a concise and practical overview of the LLM red teaming literature.
arXiv Detail & Related papers (2025-03-03T17:04:22Z) - Automated Progressive Red Teaming [38.723546092060666]
Manual red teaming is time-consuming, costly and lacks scalability.
We propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework.
APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts adversarial prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples.
arXiv Detail & Related papers (2024-07-04T12:14:27Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.
We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.
We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - BruSLeAttack: A Query-Efficient Score-Based Black-Box Sparse Adversarial Attack [22.408968332454062]
We study the unique, less-well understood problem of generating sparse adversarial samples simply by observing the score-based replies to model queries.
We develop the BruSLeAttack-a new, faster (more query-efficient) algorithm for the problem.
Our work facilitates faster evaluation of model vulnerabilities and raises our vigilance on the safety, security and reliability of deployed systems.
arXiv Detail & Related papers (2024-04-08T08:59:26Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Red-Teaming Large Language Models using Chain of Utterances for
Safety-Alignment [32.2246459413988]
We propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming.
We show that even widely deployed models are susceptible to the Chain of Utterances-based (CoU) prompting.
We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts.
arXiv Detail & Related papers (2023-08-18T16:27:04Z) - FLIRT: Feedback Loop In-context Red Teaming [79.63896510559357]
We propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.