Gradient-Based Language Model Red Teaming
- URL: http://arxiv.org/abs/2401.16656v1
- Date: Tue, 30 Jan 2024 01:19:25 GMT
- Title: Gradient-Based Language Model Red Teaming
- Authors: Nevan Wichers, Carson Denison, Ahmad Beirami
- Abstract summary: Red teaming is a strategy for identifying weaknesses in generative language models (LMs)
Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans.
We present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses.
- Score: 9.972783485792885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Red teaming is a common strategy for identifying weaknesses in generative
language models (LMs), where adversarial prompts are produced that trigger an
LM to generate unsafe responses. Red teaming is instrumental for both model
alignment and evaluation, but is labor-intensive and difficult to scale when
done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a
red teaming method for automatically generating diverse prompts that are likely
to cause an LM to output unsafe responses. GBRT is a form of prompt learning,
trained by scoring an LM response with a safety classifier and then
backpropagating through the frozen safety classifier and LM to update the
prompt. To improve the coherence of input prompts, we introduce two variants
that add a realism loss and fine-tune a pretrained model to generate the
prompts instead of learning the prompts directly. Our experiments show that
GBRT is more effective at finding prompts that trigger an LM to generate unsafe
responses than a strong reinforcement learning-based red teaming approach, and
succeeds even when the LM has been fine-tuned to produce safer outputs.
Related papers
- Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Curiosity-driven Red-teaming for Large Language Models [43.448044721642916]
Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content.
relying solely on human testers is expensive and time-consuming.
Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods.
arXiv Detail & Related papers (2024-02-29T18:55:03Z) - Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models [9.688626139309013]
Retrieval-Augmented Generation is considered as a means to improve the trustworthiness of text generation from large language models.
In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers.
We introduce a novel optimization technique called Gradient Guided Prompt Perturbation.
arXiv Detail & Related papers (2024-02-11T12:25:41Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - MART: Improving LLM Safety with Multi-round Automatic Red-Teaming [72.2127916030909]
We propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation.
On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART.
Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.
arXiv Detail & Related papers (2023-11-13T19:13:29Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z) - RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning [84.75064077323098]
This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL)
RLPrompt is flexibly applicable to different types of LMs, such as masked gibberish (e.g., grammaBERT) and left-to-right models (e.g., GPTs)
Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods.
arXiv Detail & Related papers (2022-05-25T07:50:31Z) - Red Teaming Language Models with Language Models [8.237872606555383]
Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways.
Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases.
In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM.
arXiv Detail & Related papers (2022-02-07T15:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.