Related papers: Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

URL: http://arxiv.org/abs/2209.07858v2
Date: Tue, 22 Nov 2022 19:12:57 GMT
Title: Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Authors: Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark
Abstract summary: We investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types. We release our dataset of 38,961 red team attacks for others to analyze and learn from.
Score: 10.836210010868932
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

Related papers

Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking [21.23826888841565]
We present a novel approach for training small language models for reasoning-intensive document ranking. We use web data and a teacher LLM to automatically generate high-quality training examples with relevance explanations. Our model ranks third on the leaderboard while using substantially fewer parameters than other approaches.
arXiv Detail & Related papers (2025-04-04T21:27:48Z)
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models [1.9574002186090496]
The rapid growth of Large Language Models (LLMs) presents significant privacy, security, and ethical concerns. Researchers have recently complemented these efforts with an offensive approach that involves red teaming. This paper provides a concise and practical overview of the LLM red teaming literature.
arXiv Detail & Related papers (2025-03-03T17:04:22Z)
Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework. SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions. We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets. We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data. In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
Towards Red Teaming in Multimodal and Multilingual Translation [7.440772334845366]
This paper presents the first study on human-based red teaming for Machine Translation. It marks a significant step towards understanding and improving the performance of translation models. We report lessons learned and provide recommendations for both translation models and red teaming drills.
arXiv Detail & Related papers (2024-01-29T15:49:40Z)
ROBBIE: Robust Bias Evaluation of Large Generative Language Models [27.864027322486375]
Different prompt-based datasets can be used to measure social bias across multiple text domains and demographic axes. We compare 6 different prompt-based bias and toxicity metrics across 12 demographic axes and 5 families of generative LLMs. We conduct a comprehensive study of how well 3 bias/toxicity mitigation techniques perform across our suite of measurements.
arXiv Detail & Related papers (2023-11-29T23:03:04Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
FLIRT: Feedback Loop In-context Red Teaming [79.63896510559357]
We propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z)
Explore, Establish, Exploit: Red Teaming Language Models from Scratch [7.949645304649025]
We consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures. We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements.
arXiv Detail & Related papers (2023-06-15T18:49:50Z)
Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension [27.538957000237176]
Humans create questions adversarially, such that the model fails to answer them correctly. We collect 36,000 samples with progressively stronger models in the annotation loop. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets. We find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.
arXiv Detail & Related papers (2020-02-02T00:22:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.