Related papers: Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values

Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values

URL: http://arxiv.org/abs/2601.18760v1
Date: Mon, 26 Jan 2026 18:27:00 GMT
Title: Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values
Authors: Henry Bell, Lara Neubauer da Costa Schertel, Bochu Ding, Brandon Fain,
Abstract summary: Grounded Constitutional AI (GCAI) is a unified framework for generating constitutions of principles.<n>We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior.
Score: 0.2511917198008257
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A crucial consideration when developing and deploying Large Language Models (LLMs) is the human values to which these models are aligned. In the constitutional framework of alignment models are aligned to a set of principles (the constitution) specified in natural language. However, it is unclear how to fairly determine this constitution with widespread stakeholder input. In this work we propose Grounded Constitutional AI (GCAI), a unified framework for generating constitutions of principles that are representative of both users' general expectations toward AI (general principles) and their interaction-time preferences (contextual principles). We extend the Inverse Constitutional AI (ICAI) approach to generate contextual principles from human preference annotation data by leveraging human-provided \textit{reasons} for their preferences. We supplement these contextual principles with general principles surfaced from user statements of \textit{values} regarding AI. We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior. Additionally participants consider the GCAI constitution to be more morally grounded, coherent, and pluralistic.

Related papers

Resource Rational Contractualism Should Guide AI Alignment [69.07915246220985]
Contractualist alignment proposes grounding decisions in agreements that diverse stakeholders would endorse.<n>We propose Resource-Rationalism: a framework where AI systems approximate the agreements rational parties would form.<n>An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.
arXiv Detail & Related papers (2025-06-20T18:57:13Z)
C3AI: Crafting and Evaluating Constitutions for Constitutional AI [4.393788620560099]
We introduce the C3AI framework, which serves two key functions: selecting and structuring principles to form effective constitutions before fine-tuning.<n>By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles.<n>Fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results.
arXiv Detail & Related papers (2025-02-21T10:26:42Z)
SPRI: Aligning Large Language Models with Context-Situated Principles [53.07731637246485]
Situated-PRInciples (SPRI) is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response.<n>We evaluate SPRI on three tasks, and show that SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones.
arXiv Detail & Related papers (2025-02-05T17:32:29Z)
Decoding Human Preferences in Alignment: An Improved Approach to Inverse Constitutional AI [0.0]
We develop a rule-based framework for aligning Large Language Models (LLMs)<n>We refine the Inverse Constitutional AI (ICAI) algorithm, which extracts constitutions from preference datasets.<n>Our results highlight the potential of these principles to foster more transparent and adaptable alignment methods.
arXiv Detail & Related papers (2025-01-28T17:59:56Z)
The Fundamental Rights Impact Assessment (FRIA) in the AI Act: Roots, legal obligations and key elements for a model template [55.2480439325792]
Article aims to fill existing gaps in the theoretical and methodological elaboration of the Fundamental Rights Impact Assessment (FRIA)<n>This article outlines the main building blocks of a model template for the FRIA.<n>It can serve as a blueprint for other national and international regulatory initiatives to ensure that AI is fully consistent with human rights.
arXiv Detail & Related papers (2024-11-07T11:55:55Z)
Beyond Preferences in AI Alignment [15.878773061188516]
We characterize and challenge the preferentist approach to AI alignment. We show how preferences fail to capture the thick semantic content of human values. We argue that AI systems should be aligned with normative standards appropriate to their social roles.
arXiv Detail & Related papers (2024-08-30T03:14:20Z)
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference. Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z)
Towards Responsible AI in Banking: Addressing Bias for Fair Decision-Making [69.44075077934914]
"Responsible AI" emphasizes the critical nature of addressing biases within the development of a corporate culture. This thesis is structured around three fundamental pillars: understanding bias, mitigating bias, and accounting for bias. In line with open-source principles, we have released Bias On Demand and FairView as accessible Python packages.
arXiv Detail & Related papers (2024-01-13T14:07:09Z)
Specific versus General Principles for Constitutional AI [27.08490948333949]
Constitutional AI offers an alternative, replacing human feedback with feedback conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors.
arXiv Detail & Related papers (2023-10-20T20:12:45Z)
Natural Language Decompositions of Implicit Content Enable Better Text Representations [52.992875653864076]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.<n>We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.<n>Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z)
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards [0.0]
Legal standards facilitate robust communication of inherently vague and underspecified goals. Our research is an initial step toward a framework for evaluating AI understanding of legal standards more broadly.
arXiv Detail & Related papers (2023-01-24T16:03:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.