Related papers: Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

URL: http://arxiv.org/abs/2306.09442v3
Date: Wed, 11 Oct 2023 00:37:33 GMT
Title: Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Authors: Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell
Abstract summary: We consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures. We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements.
Score: 7.949645304649025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deploying large language models (LMs) can pose hazards from harmful outputs such as toxic or false text. Prior work has introduced automated tools that elicit harmful outputs to identify these risks. While this is a valuable step toward securing models, these approaches rely on a pre-existing way to efficiently classify undesirable outputs. Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model. Furthermore, when failures can be easily classified in advance, red-teaming has limited marginal value because problems can be avoided by simply filtering training data and/or model outputs. Here, we consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures. Our framework consists of three steps: 1) Exploring the model's range of behaviors in the desired context; 2) Establishing a definition and measurement for undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure to develop diverse adversarial prompts. We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements. In doing so, we construct the CommonClaim dataset of 20,000 statements labeled by humans as common-knowledge-true, common knowledge-false, or neither. We are making code and data available.

Related papers

Confidence Elicitation: A New Attack Vector for Large Language Models [32.22764815262567]
Large language models (LLMs) with billions of parameters suffer from adversarial attacks just like their earlier, smaller counterparts. With the introduction of closed-source models, no information about the model is available apart from the generated output. This means that current black-box attacks can only utilize the final prediction to detect if an attack is successful. In this work, we investigate and demonstrate the potential of attack guidance, akin to using output probabilities, while having only black-box access in a classification setting.
arXiv Detail & Related papers (2025-02-07T04:07:36Z)
Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings. We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Anti-LM Decoding for Zero-shot In-context Machine Translation [59.26037416204157]
This work introduces an Anti-Language Model objective with a decay factor designed to address the weaknesses of In-context Machine Translation. We conduct experiments across 3 model types and sizes, 3 language directions, and for both greedy decoding and beam search.
arXiv Detail & Related papers (2023-11-14T17:09:43Z)
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases [32.2246459413988]
Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. We present a new perspective on safety research i.e., red-teaming through Unalignment. Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
arXiv Detail & Related papers (2023-10-22T13:55:46Z)
Probing LLMs for hate speech detection: strengths and vulnerabilities [8.626059038321724]
We utilise different prompt variation, input information and evaluate large language models in zero shot setting. We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans. We find that on average including the target information in the pipeline improves the model performance substantially.
arXiv Detail & Related papers (2023-10-19T16:11:02Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned [10.836210010868932]
We investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types. We release our dataset of 38,961 red team attacks for others to analyze and learn from.
arXiv Detail & Related papers (2022-08-23T23:37:14Z)
Capturing Failures of Large Language Models via Human Cognitive Biases [18.397404180932373]
We show that OpenAI's Codex errs based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. Our experiments suggest that cognitive science can be a useful jumping-off point to better understand how contemporary machine learning systems behave.
arXiv Detail & Related papers (2022-02-24T18:58:52Z)
Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle. In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize. Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.