Related papers: The Capacity for Moral Self-Correction in Large Language Models

The Capacity for Moral Self-Correction in Large Language Models

URL: http://arxiv.org/abs/2302.07459v1
Date: Wed, 15 Feb 2023 04:25:40 GMT
Title: The Capacity for Moral Self-Correction in Large Language Models
Authors: Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil\.e Luko\v{s}i\=ut\.e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Sandipan Kundu, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Christopher Olah, Jack Clark, Samuel R. Bowman, Jared Kaplan
Abstract summary: We test the hypothesis that language models trained with reinforcement learning from human feedback have the capability to "morally self-correct" We find strong evidence in support of this hypothesis across three different experiments. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
Score: 17.865286693602656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

Related papers

Smaller Large Language Models Can Do Moral Self-Correction [7.899707459486236]
Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs) Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction.
arXiv Detail & Related papers (2024-10-30T22:58:57Z)
Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction [5.271054803267951]
We aim to answer two fundamental questions for moral self-correction. We examine how different self-correction components interact to intervene the embedded morality within hidden states. We propose a validation framework, self-distinguish, that requires effective self-correction.
arXiv Detail & Related papers (2024-10-27T16:52:21Z)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z)
Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models [28.53750311045418]
We use a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates. We collect moral permissibility and intention judgments from human participants for a subset of our items. We find that moral dilemmas in which the harm is a necessary means resulted in lower permissibility and higher intention ratings for both participants and language models.
arXiv Detail & Related papers (2024-04-17T01:13:04Z)
What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations [48.686872351114964]
Moral or ethical judgments rely heavily on the specific contexts in which they occur. We introduce defeasible moral reasoning: a task to provide grounded contexts that make an action more or less morally acceptable. We distill a high-quality dataset of 1.2M entries of contextualizations and rationales for 115K defeasible moral actions.
arXiv Detail & Related papers (2023-10-24T00:51:29Z)
Physics of Language Models: Part 3.2, Knowledge Manipulation [51.68385617116854]
This paper investigates four fundamental knowledge manipulation tasks. We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks. Our findings also apply to modern pretrained language models such as GPT-4.
arXiv Detail & Related papers (2023-09-25T17:50:41Z)
A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds? [2.7342737448775534]
Large Language Models (LLMs) have been linked to claims about human-like linguistic performance. We analyze the contribution of LLMs as theoretically informative representations of a target cognitive system. We evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing.
arXiv Detail & Related papers (2023-07-26T18:58:53Z)
Injecting structural hints: Using language models to study inductive biases in language learning [40.8902073270634]
We inject inductive bias into language models by pretraining on formally-structured data. We then evaluate the biased learners' ability to learn typologically-diverse natural languages. We show that non-context-free relationships form the best inductive biases.
arXiv Detail & Related papers (2023-04-25T18:00:08Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Speaking Multiple Languages Affects the Moral Bias of Language Models [70.94372902010232]
Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer. Do the models capture moral norms from English and impose them on other languages? Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions.
arXiv Detail & Related papers (2022-11-14T20:08:54Z)
Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data. This may cause the models to grasp cultural values including moral judgments from the high-resource languages. The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.