MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
- URL: http://arxiv.org/abs/2311.07689v1
- Date: Mon, 13 Nov 2023 19:13:29 GMT
- Title: MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
- Authors: Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan
Wang, Jiawei Han, Yuning Mao
- Abstract summary: We propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation.
On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART.
Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.
- Score: 72.2127916030909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Red-teaming is a common practice for mitigating unsafe behaviors in Large
Language Models (LLMs), which involves thoroughly assessing LLMs to identify
potential flaws and addressing them with responsible and accurate responses.
While effective, manual red-teaming is costly, and existing automatic
red-teaming typically discovers safety risks without addressing them. In this
paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which
incorporates both automatic adversarial prompt writing and safe response
generation, significantly increasing red-teaming scalability and the safety of
the target LLM. Specifically, an adversarial LLM and a target LLM interplay
with each other in an iterative manner, where the adversarial LLM aims to
generate challenging prompts that elicit unsafe responses from the target LLM,
while the target LLM is fine-tuned with safety aligned data on these
adversarial prompts. In each round, the adversarial LLM crafts better attacks
on the updated target LLM, while the target LLM also improves itself through
safety fine-tuning. On adversarial prompt benchmarks, the violation rate of an
LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART,
achieving comparable performance to LLMs with extensive adversarial prompt
writing. Notably, model helpfulness on non-adversarial prompts remains stable
throughout iterations, indicating the target LLM maintains strong performance
on instruction following.
Related papers
- Automated Progressive Red Teaming [38.723546092060666]
Manual red teaming is time-consuming, costly and lacks scalability.
We propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework.
APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts adversarial prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples.
arXiv Detail & Related papers (2024-07-04T12:14:27Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability [25.750371424096436]
Large Language Models (LLMs) are increasingly deployed in various applications.
Our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance.
We introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability.
arXiv Detail & Related papers (2024-05-23T12:19:59Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training.
Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.