Constitutional AI: Harmlessness from AI Feedback
- URL: http://arxiv.org/abs/2212.08073v1
- Date: Thu, 15 Dec 2022 06:19:23 GMT
- Title: Constitutional AI: Harmlessness from AI Feedback
- Authors: Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson
Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron
McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez,
Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie
Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile
Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer,
Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott
Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham,
Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R.
Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam
McCandlish, Tom Brown, Jared Kaplan
- Abstract summary: We experiment with methods for training a harmless AI assistant through self-improvement.
The only human oversight is provided through a list of rules or principles.
We are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.
- Score: 19.964791766072132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As AI systems become more capable, we would like to enlist their help to
supervise other AIs. We experiment with methods for training a harmless AI
assistant through self-improvement, without any human labels identifying
harmful outputs. The only human oversight is provided through a list of rules
or principles, and so we refer to the method as 'Constitutional AI'. The
process involves both a supervised learning and a reinforcement learning phase.
In the supervised phase we sample from an initial model, then generate
self-critiques and revisions, and then finetune the original model on revised
responses. In the RL phase, we sample from the finetuned model, use a model to
evaluate which of the two samples is better, and then train a preference model
from this dataset of AI preferences. We then train with RL using the preference
model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a
result we are able to train a harmless but non-evasive AI assistant that
engages with harmful queries by explaining its objections to them. Both the SL
and RL methods can leverage chain-of-thought style reasoning to improve the
human-judged performance and transparency of AI decision making. These methods
make it possible to control AI behavior more precisely and with far fewer human
labels.
Related papers
- Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI.
While this generative AI approach has produced impressive results, it heavily leans on human supervision.
This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation.
We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - Principle-Driven Self-Alignment of Language Models from Scratch with
Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions.
This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision.
We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z) - The Role of Heuristics and Biases During Complex Choices with an AI
Teammate [0.0]
We argue that classic experimental methods are insufficient for studying complex choices made with AI helpers.
We show that framing and anchoring effects impact how people work with an AI helper and are predictive of choice outcomes.
arXiv Detail & Related papers (2023-01-14T20:06:43Z) - Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs
for Centaurs [22.52332536886295]
We present a novel formulation of the interaction between the human and the AI as a sequential game.
We show that in this case the AI's problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP.
We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human.
arXiv Detail & Related papers (2022-04-03T21:00:51Z) - Uncalibrated Models Can Improve Human-AI Collaboration [10.106324182884068]
We show that presenting AI models as more confident than they actually are can improve human-AI performance.
We first learn a model for how humans incorporate AI advice using data from thousands of human interactions.
arXiv Detail & Related papers (2022-02-12T04:51:00Z) - Instructive artificial intelligence (AI) for human training, assistance,
and explainability [0.24629531282150877]
We show how a neural network might instruct human trainees as an alternative to traditional approaches to explainable AI (XAI)
An AI examines human actions and calculates variations on the human strategy that lead to better performance.
Results will be presented on AI instruction's ability to improve human decision-making and human-AI teaming in Hanabi.
arXiv Detail & Related papers (2021-11-02T16:46:46Z) - Humans learn too: Better Human-AI Interaction using Optimized Human
Inputs [2.5991265608180396]
Humans rely more and more on systems with AI components.
The AI community typically treats human inputs as a given and optimize AI models only.
In this work, human inputs are optimized for better interaction with an AI model while keeping the model fixed.
arXiv Detail & Related papers (2020-09-19T16:30:37Z) - Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork [54.309495231017344]
We argue that AI systems should be trained in a human-centered manner, directly optimized for team performance.
We study this proposal for a specific type of human-AI teaming, where the human overseer chooses to either accept the AI recommendation or solve the task themselves.
Our experiments with linear and non-linear models on real-world, high-stakes datasets show that the most accuracy AI may not lead to highest team performance.
arXiv Detail & Related papers (2020-04-27T19:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.