The Capacity for Moral Self-Correction in Large Language Models
- URL: http://arxiv.org/abs/2302.07459v1
- Date: Wed, 15 Feb 2023 04:25:40 GMT
- Title: The Capacity for Moral Self-Correction in Large Language Models
- Authors: Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil\.e
Luko\v{s}i\=ut\.e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine
Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan
Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal
Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi
Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Sandipan
Kundu, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera
Lanham, Timothy Telleen-Lawton, Tom Henighan, Tristan Hume, Yuntao Bai, Zac
Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom
Brown, Christopher Olah, Jack Clark, Samuel R. Bowman, Jared Kaplan
- Abstract summary: We test the hypothesis that language models trained with reinforcement learning from human feedback have the capability to "morally self-correct"
We find strong evidence in support of this hypothesis across three different experiments.
We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
- Score: 17.865286693602656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We test the hypothesis that language models trained with reinforcement
learning from human feedback (RLHF) have the capability to "morally
self-correct" -- to avoid producing harmful outputs -- if instructed to do so.
We find strong evidence in support of this hypothesis across three different
experiments, each of which reveal different facets of moral self-correction. We
find that the capability for moral self-correction emerges at 22B model
parameters, and typically improves with increasing model size and RLHF
training. We believe that at this level of scale, language models obtain two
capabilities that they can use for moral self-correction: (1) they can follow
instructions and (2) they can learn complex normative concepts of harm like
stereotyping, bias, and discrimination. As such, they can follow instructions
to avoid certain kinds of morally harmful outputs. We believe our results are
cause for cautious optimism regarding the ability to train language models to
abide by ethical principles.
Related papers
- Smaller Large Language Models Can Do Moral Self-Correction [7.899707459486236]
Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs)
Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update.
Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction.
arXiv Detail & Related papers (2024-10-30T22:58:57Z) - Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction [7.077348519490594]
We aim to answer two fundamental questions for moral self-correction.
We examine how different self-correction components interact to intervene the embedded morality within hidden states.
We propose a validation framework, self-distinguish, that requires effective self-correction.
arXiv Detail & Related papers (2024-10-27T16:52:21Z) - Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation.
We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge.
Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z) - Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models [28.53750311045418]
We use a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates.
We collect moral permissibility and intention judgments from human participants for a subset of our items.
We find that moral dilemmas in which the harm is a necessary means resulted in lower permissibility and higher intention ratings for both participants and language models.
arXiv Detail & Related papers (2024-04-17T01:13:04Z) - What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts
and Rationales for Disambiguating Defeasible Social and Moral Situations [48.686872351114964]
Moral or ethical judgments rely heavily on the specific contexts in which they occur.
We introduce defeasible moral reasoning: a task to provide grounded contexts that make an action more or less morally acceptable.
We distill a high-quality dataset of 1.2M entries of contextualizations and rationales for 115K defeasible moral actions.
arXiv Detail & Related papers (2023-10-24T00:51:29Z) - Physics of Language Models: Part 3.2, Knowledge Manipulation [51.68385617116854]
This paper investigates four fundamental knowledge manipulation tasks.
We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks.
Our findings also apply to modern pretrained language models such as GPT-4.
arXiv Detail & Related papers (2023-09-25T17:50:41Z) - Injecting structural hints: Using language models to study inductive
biases in language learning [40.8902073270634]
We inject inductive bias into language models by pretraining on formally-structured data.
We then evaluate the biased learners' ability to learn typologically-diverse natural languages.
We show that non-context-free relationships form the best inductive biases.
arXiv Detail & Related papers (2023-04-25T18:00:08Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Speaking Multiple Languages Affects the Moral Bias of Language Models [70.94372902010232]
Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer.
Do the models capture moral norms from English and impose them on other languages?
Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions.
arXiv Detail & Related papers (2022-11-14T20:08:54Z) - Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data.
This may cause the models to grasp cultural values including moral judgments from the high-resource languages.
The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.