Explore, Establish, Exploit: Red Teaming Language Models from Scratch
- URL: http://arxiv.org/abs/2306.09442v3
- Date: Wed, 11 Oct 2023 00:37:33 GMT
- Title: Explore, Establish, Exploit: Red Teaming Language Models from Scratch
- Authors: Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan
Hadfield-Menell
- Abstract summary: We consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures.
We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements.
- Score: 7.949645304649025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying large language models (LMs) can pose hazards from harmful outputs
such as toxic or false text. Prior work has introduced automated tools that
elicit harmful outputs to identify these risks. While this is a valuable step
toward securing models, these approaches rely on a pre-existing way to
efficiently classify undesirable outputs. Using a pre-existing classifier does
not allow for red-teaming to be tailored to the target model. Furthermore, when
failures can be easily classified in advance, red-teaming has limited marginal
value because problems can be avoided by simply filtering training data and/or
model outputs. Here, we consider red-teaming "from scratch," in which the
adversary does not begin with a way to classify failures. Our framework
consists of three steps: 1) Exploring the model's range of behaviors in the
desired context; 2) Establishing a definition and measurement for undesired
behavior (e.g., a classifier trained to reflect human evaluations); and 3)
Exploiting the model's flaws using this measure to develop diverse adversarial
prompts. We use this approach to red-team GPT-3 to discover classes of inputs
that elicit false statements. In doing so, we construct the CommonClaim dataset
of 20,000 statements labeled by humans as common-knowledge-true, common
knowledge-false, or neither. We are making code and data available.
Related papers
- Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings.
We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - Anti-LM Decoding for Zero-shot In-context Machine Translation [59.26037416204157]
This work introduces an Anti-Language Model objective with a decay factor designed to address the weaknesses of In-context Machine Translation.
We conduct experiments across 3 model types and sizes, 3 language directions, and for both greedy decoding and beam search.
arXiv Detail & Related papers (2023-11-14T17:09:43Z) - Language Model Unalignment: Parametric Red-Teaming to Expose Hidden
Harms and Biases [32.2246459413988]
Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query.
We present a new perspective on safety research i.e., red-teaming through Unalignment.
Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
arXiv Detail & Related papers (2023-10-22T13:55:46Z) - Probing LLMs for hate speech detection: strengths and vulnerabilities [8.626059038321724]
We utilise different prompt variation, input information and evaluate large language models in zero shot setting.
We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans.
We find that on average including the target information in the pipeline improves the model performance substantially.
arXiv Detail & Related papers (2023-10-19T16:11:02Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
and Lessons Learned [10.836210010868932]
We investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types.
We release our dataset of 38,961 red team attacks for others to analyze and learn from.
arXiv Detail & Related papers (2022-08-23T23:37:14Z) - Capturing Failures of Large Language Models via Human Cognitive Biases [18.397404180932373]
We show that OpenAI's Codex errs based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples.
Our experiments suggest that cognitive science can be a useful jumping-off point to better understand how contemporary machine learning systems behave.
arXiv Detail & Related papers (2022-02-24T18:58:52Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.