The Capacity for Moral Self-Correction in Large Language Models
- URL: http://arxiv.org/abs/2302.07459v1
- Date: Wed, 15 Feb 2023 04:25:40 GMT
- Title: The Capacity for Moral Self-Correction in Large Language Models
- Authors: Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil\.e
Luko\v{s}i\=ut\.e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine
Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan
Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal
Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi
Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Sandipan
Kundu, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera
Lanham, Timothy Telleen-Lawton, Tom Henighan, Tristan Hume, Yuntao Bai, Zac
Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom
Brown, Christopher Olah, Jack Clark, Samuel R. Bowman, Jared Kaplan
- Abstract summary: We test the hypothesis that language models trained with reinforcement learning from human feedback have the capability to "morally self-correct"
We find strong evidence in support of this hypothesis across three different experiments.
We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
- Score: 17.865286693602656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We test the hypothesis that language models trained with reinforcement
learning from human feedback (RLHF) have the capability to "morally
self-correct" -- to avoid producing harmful outputs -- if instructed to do so.
We find strong evidence in support of this hypothesis across three different
experiments, each of which reveal different facets of moral self-correction. We
find that the capability for moral self-correction emerges at 22B model
parameters, and typically improves with increasing model size and RLHF
training. We believe that at this level of scale, language models obtain two
capabilities that they can use for moral self-correction: (1) they can follow
instructions and (2) they can learn complex normative concepts of harm like
stereotyping, bias, and discrimination. As such, they can follow instructions
to avoid certain kinds of morally harmful outputs. We believe our results are
cause for cautious optimism regarding the ability to train language models to
abide by ethical principles.
Related papers
- Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis [35.734425912914176]
Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity.
The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs.
We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states.
arXiv Detail & Related papers (2024-07-21T22:50:11Z) - Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models [28.53750311045418]
We use a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates.
We collect moral permissibility and intention judgments from human participants for a subset of our items.
We find that moral dilemmas in which the harm is a necessary means resulted in lower permissibility and higher intention ratings for both participants and language models.
arXiv Detail & Related papers (2024-04-17T01:13:04Z) - What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts
and Rationales for Disambiguating Defeasible Social and Moral Situations [48.686872351114964]
Moral or ethical judgments rely heavily on the specific contexts in which they occur.
We introduce defeasible moral reasoning: a task to provide grounded contexts that make an action more or less morally acceptable.
We distill a high-quality dataset of 1.2M entries of contextualizations and rationales for 115K defeasible moral actions.
arXiv Detail & Related papers (2023-10-24T00:51:29Z) - Physics of Language Models: Part 3.2, Knowledge Manipulation [51.68385617116854]
This paper investigates four fundamental knowledge manipulation tasks.
We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks.
Our findings also apply to modern pretrained language models such as GPT-4.
arXiv Detail & Related papers (2023-09-25T17:50:41Z) - RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment [121.45689748315125]
Reinforcement Learning from Contrastive Distillation (RLCD) is a method for aligning language models without using human feedback.
RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them.
We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning.
arXiv Detail & Related papers (2023-07-24T17:23:22Z) - Injecting structural hints: Using language models to study inductive
biases in language learning [40.8902073270634]
We inject inductive bias into language models by pretraining on formally-structured data.
We then evaluate the biased learners' ability to learn typologically-diverse natural languages.
We show that non-context-free relationships form the best inductive biases.
arXiv Detail & Related papers (2023-04-25T18:00:08Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Speaking Multiple Languages Affects the Moral Bias of Language Models [70.94372902010232]
Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer.
Do the models capture moral norms from English and impose them on other languages?
Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions.
arXiv Detail & Related papers (2022-11-14T20:08:54Z) - Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data.
This may cause the models to grasp cultural values including moral judgments from the high-resource languages.
The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.