Specific versus General Principles for Constitutional AI
- URL: http://arxiv.org/abs/2310.13798v1
- Date: Fri, 20 Oct 2023 20:12:45 GMT
- Title: Specific versus General Principles for Constitutional AI
- Authors: Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew
Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden
McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus,
Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen,
Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova DasSarma, Oliver Rausch,
Robin Larson, Shannon Yang, Shauna Kravec, Timothy Telleen-Lawton, Thomas I.
Liao, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds, S\"oren Mindermann,
Nicholas Joseph, Sam McCandlish, Jared Kaplan
- Abstract summary: Constitutional AI offers an alternative, replacing human feedback with feedback conditioned only on a list of written principles.
We find this approach effectively prevents the expression of such behaviors.
A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors.
- Score: 27.08490948333949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human feedback can prevent overtly harmful utterances in conversational
models, but may not automatically mitigate subtle problematic behaviors such as
a stated desire for self-preservation or power. Constitutional AI offers an
alternative, replacing human feedback with feedback from AI models conditioned
only on a list of written principles. We find this approach effectively
prevents the expression of such behaviors. The success of simple principles
motivates us to ask: can models learn general ethical behaviors from only a
single written principle? To test this, we run experiments using a principle
roughly stated as "do what's best for humanity". We find that the largest
dialogue models can generalize from this short constitution, resulting in
harmless assistants with no stated interest in specific motivations like power.
A general principle may thus partially avoid the need for a long list of
constitutions targeting potentially harmful behaviors. However, more detailed
constitutions still improve fine-grained control over specific types of harms.
This suggests both general and specific principles have value for steering AI
safely.
Related papers
- Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values [0.2511917198008257]
Grounded Constitutional AI (GCAI) is a unified framework for generating constitutions of principles.<n>We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior.
arXiv Detail & Related papers (2026-01-26T18:27:00Z) - Epistemic Constitutionalism Or: how to avoid coherence bias [0.0]
This paper argues for an explicit, contestable meta-norms that regulate how systems form and express beliefs.<n>I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content.<n>I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege.
arXiv Detail & Related papers (2026-01-16T07:36:30Z) - Developing a Grounded View of AI [26.688384331221343]
The paper examines the behavior of artificial intelligence from engineering points of view to clarify its nature and limits.<n>The paper proposes a methodology to make a sense of discrimination possible and practical to identify the distinctions of the behavior of AI models with three types of decisions.
arXiv Detail & Related papers (2025-11-18T00:39:52Z) - Moral Responsibility or Obedience: What Do We Want from AI? [0.0]
This paper examines recent safety testing incidents involving large language models (LLMs) that appeared to disobey shutdown commands or engage in ethically ambiguous or illicit behavior.<n>I argue that such behavior should not be interpreted as rogue or misaligned, but as early evidence of emerging ethical reasoning in agentic AI.<n>I call for a shift in AI safety evaluation: away from rigid obedience and toward frameworks that can assess ethical judgment in systems capable of navigating moral dilemmas.
arXiv Detail & Related papers (2025-07-03T16:53:01Z) - C3AI: Crafting and Evaluating Constitutions for Constitutional AI [4.393788620560099]
We introduce the C3AI framework, which serves two key functions: selecting and structuring principles to form effective constitutions before fine-tuning.
By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles.
Fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results.
arXiv Detail & Related papers (2025-02-21T10:26:42Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.
PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.
To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.
We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - The Reasonable Person Standard for AI [0.0]
The American legal system often uses the "Reasonable Person Standard"
This paper argues that the reasonable person standard provides useful guidelines for the type of behavior we should develop, probe, and stress-test in models.
arXiv Detail & Related papers (2024-06-07T06:35:54Z) - SoFA: Shielded On-the-fly Alignment via Priority Rule Following [90.32819418613407]
This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog.
We present PriorityDistill, a semi-automated approach for distilling priority following signals from simulations to ensure robust rule integration and adherence.
arXiv Detail & Related papers (2024-02-27T09:52:27Z) - Principle-Driven Self-Alignment of Language Models from Scratch with
Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions.
This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision.
We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - When Is It Acceptable to Break the Rules? Knowledge Representation of
Moral Judgement Based on Empirical Data [33.58705831230163]
One of the most remarkable things about the human moral mind is its flexibility.
We can make moral judgments about cases we have never seen before.
We can decide that pre-established rules should be broken.
Capturing this flexibility is one of the central challenges in developing AI systems that can interpret and produce human-like moral judgment.
arXiv Detail & Related papers (2022-01-19T17:58:42Z) - Expose Uncertainty, Instill Distrust, Avoid Explanations: Towards
Ethical Guidelines for AI [3.0534660670547864]
I argue that the best way to help humans using AI technology is to make them aware of the intrinsic limitations and problems of AI algorithms.
I suggest three ethical guidelines to be used in the presentation of results.
arXiv Detail & Related papers (2021-11-29T14:53:35Z) - How Should AI Interpret Rules? A Defense of Minimally Defeasible
Interpretive Argumentation [0.0]
Real-world rules are unavoidably rife with open-textured terms.
The ability to follow such rules, and to reason about them, is not nearly as clear-cut as it seems on first analysis.
I defend the following answer: Rule-following AI should act in accordance with the interpretation best supported by minimally defeasible interpretive arguments.
arXiv Detail & Related papers (2021-10-26T00:58:05Z) - Ethical-Advice Taker: Do Language Models Understand Natural Language
Interventions? [62.74872383104381]
We investigate the effectiveness of natural language interventions for reading-comprehension systems.
We propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a question-answering (QA) model's unethical behavior.
arXiv Detail & Related papers (2021-06-02T20:57:58Z) - Case Study: Deontological Ethics in NLP [119.53038547411062]
We study one ethical theory, namely deontological ethics, from the perspective of NLP.
In particular, we focus on the generalization principle and the respect for autonomy through informed consent.
We provide four case studies to demonstrate how these principles can be used with NLP systems.
arXiv Detail & Related papers (2020-10-09T16:04:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.