C3AI: Crafting and Evaluating Constitutions for Constitutional AI
- URL: http://arxiv.org/abs/2502.15861v1
- Date: Fri, 21 Feb 2025 10:26:42 GMT
- Title: C3AI: Crafting and Evaluating Constitutions for Constitutional AI
- Authors: Yara Kyrychenko, Ke Zhou, Edyta Bogucka, Daniele Quercia,
- Abstract summary: We introduce the C3AI framework, which serves two key functions: selecting and structuring principles to form effective constitutions before fine-tuning.<n>By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles.<n>Fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results.
- Score: 4.393788620560099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Constitutional AI (CAI) guides LLM behavior using constitutions, but identifying which principles are most effective for model alignment remains an open challenge. We introduce the C3AI framework (\textit{Crafting Constitutions for CAI models}), which serves two key functions: (1) selecting and structuring principles to form effective constitutions before fine-tuning; and (2) evaluating whether fine-tuned CAI models follow these principles in practice. By analyzing principles from AI and psychology, we found that positively framed, behavior-based principles align more closely with human preferences than negatively framed or trait-based principles. In a safety alignment use case, we applied a graph-based principle selection method to refine an existing CAI constitution, improving safety measures while maintaining strong general reasoning capabilities. Interestingly, fine-tuned CAI models performed well on negatively framed principles but struggled with positively framed ones, in contrast to our human alignment results. This highlights a potential gap between principle design and model adherence. Overall, C3AI provides a structured and scalable approach to both crafting and evaluating CAI constitutions.
Related papers
- Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values [0.2511917198008257]
Grounded Constitutional AI (GCAI) is a unified framework for generating constitutions of principles.<n>We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior.
arXiv Detail & Related papers (2026-01-26T18:27:00Z) - Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale [0.225739374955489]
textscreflect is an inference-time framework for constitutional alignment.<n>textscreflect operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation self-evaluation.<n>Our results demonstrate that textscreflect significantly improves LLM conformance to diverse and complex principles.
arXiv Detail & Related papers (2026-01-26T17:54:54Z) - Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models [57.42778606399764]
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation.<n>Current reinforcement learning approaches often rely on sparse, outcome-based rewards.<n>We argue that this stems from a fundamental mismatch with the natural structure of reasoning.
arXiv Detail & Related papers (2025-10-02T00:34:15Z) - A Framework for Inherently Safer AGI through Language-Mediated Active Inference [1.9761774213809036]
This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs)<n>We present an architecture where safety guarantees are integrated into the system's core design through transparent belief representations and hierarchical value alignment.<n>The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets.
arXiv Detail & Related papers (2025-08-07T18:28:54Z) - An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models [18.62332474172811]
Large Language Models (LLMs) have demonstrated remarkable progress in instruction following and general-purpose reasoning.<n>High-quality alignment with human intent and safety norms without human annotations remains a fundamental challenge.<n>We propose an Uncertainty-Driven Adaptive Self-Alignment framework designed to improve LLM alignment in a fully automated manner.
arXiv Detail & Related papers (2025-07-23T13:00:00Z) - Feature-Based vs. GAN-Based Learning from Demonstrations: When and Why [50.191655141020505]
This survey provides a comparative analysis of feature-based and GAN-based approaches to learning from demonstrations.<n>We argue that the dichotomy between feature-based and GAN-based methods is increasingly nuanced.
arXiv Detail & Related papers (2025-07-08T11:45:51Z) - QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA [49.9801383018588]
We introduce QA-LIGN, an automatic symbolic reward decomposition approach.<n>Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions.<n>Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability.
arXiv Detail & Related papers (2025-06-09T18:24:57Z) - RLJP: Legal Judgment Prediction via First-Order Logic Rule-enhanced with Large Language Models [58.69183479148083]
Legal Judgment Prediction (LJP) is a pivotal task in legal AI.<n>Existing LJP models integrate judicial precedents and legal knowledge for high performance.<n>But they neglect legal reasoning logic, a critical component of legal judgments requiring rigorous logical analysis.<n>This paper proposes a rule-enhanced legal judgment prediction framework based on first-order logic (FOL) formalism and comparative learning (CL)
arXiv Detail & Related papers (2025-05-27T14:50:21Z) - Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning [53.92712851223158]
We formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory.<n>Under the CI framework, we align our model with three critical regulatory standards: EU AI Act, and HIPAA.<n>We employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms.
arXiv Detail & Related papers (2025-05-20T16:40:09Z) - The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach [6.0972634521845475]
This paper introduces the Priorities in Reasoning and Intrinsic Moral Evaluation (PRIME) framework.
PRIME is a comprehensive methodology for analyzing moral priorities across foundational ethical dimensions.
We apply this framework to six leading large language models (LLMs) through a dual-protocol approach.
arXiv Detail & Related papers (2025-04-27T14:26:48Z) - Contemplative Wisdom for Superalignment [1.7143967091323253]
We advocate designing AI with intrinsic morality built into its cognitive architecture and world model.
Inspired by contemplative wisdom traditions, we show how four axiomatic principles can instil a resilient Wise World Model in AI systems.
arXiv Detail & Related papers (2025-04-21T14:20:49Z) - Evaluating the Application of SOLID Principles in Modern AI Framework Architectures [0.0]
This research evaluates the extent to which modern AI frameworks, specifically scikit-learn, adhere to the SOLID design principles.
I examined each frameworks documentation, source code, and architectural components to evaluate their adherence to these principles.
arXiv Detail & Related papers (2025-03-18T00:37:23Z) - Unlocking Transparent Alignment Through Enhanced Inverse Constitutional AI for Principle Extraction [0.0]
Constitutional AI (CAI) offers an explicit, rule-based framework for guiding model outputs.<n>We refine the Inverse Constitutional AI (ICAI) algorithm, which extracts constitutions from preference datasets.<n>Our results highlight the potential of these principles to foster more transparent and adaptable alignment methods.
arXiv Detail & Related papers (2025-01-28T17:59:56Z) - Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.<n>We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z) - CSCNET: Class-Specified Cascaded Network for Compositional Zero-Shot
Learning [62.090051975043544]
Attribute and object (A-O) disentanglement is a fundamental and critical problem for Compositional Zero-shot Learning (CZSL)
We propose a novel A-O disentangled framework for CZSL, namely Class-specified Cascaded Network (CSCNet)
arXiv Detail & Related papers (2024-03-09T14:18:41Z) - Measuring Value Alignment [12.696227679697493]
This paper introduces a novel formalism to quantify the alignment between AI systems and human values.
By utilizing this formalism, AI developers and ethicists can better design and evaluate AI systems to ensure they operate in harmony with human values.
arXiv Detail & Related papers (2023-12-23T12:30:06Z) - Levels of AGI for Operationalizing Progress on the Path to AGI [64.59151650272477]
We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors.
This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI.
arXiv Detail & Related papers (2023-11-04T17:44:58Z) - Specific versus General Principles for Constitutional AI [27.08490948333949]
Constitutional AI offers an alternative, replacing human feedback with feedback conditioned only on a list of written principles.
We find this approach effectively prevents the expression of such behaviors.
A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors.
arXiv Detail & Related papers (2023-10-20T20:12:45Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - The Open-World Lottery Ticket Hypothesis for OOD Intent Classification [68.93357975024773]
We shed light on the fundamental cause of model overconfidence on OOD.
We also extend the Lottery Ticket Hypothesis to open-world scenarios.
arXiv Detail & Related papers (2022-10-13T14:58:35Z) - Combining Rules and Embeddings via Neuro-Symbolic AI for Knowledge Base
Completion [59.093293389123424]
We show that not all rule-based Knowledge Base Completion models are the same.
We propose two distinct approaches that learn in one case: 1) a mixture of relations and the other 2) a mixture of paths.
When implemented on top of neuro-symbolic AI, which learns rules by extending Boolean logic to real-valued logic, the latter model leads to superior KBC accuracy outperforming state-of-the-art rule-based KBC by 2-10% in terms of mean reciprocal rank.
arXiv Detail & Related papers (2021-09-16T17:54:56Z) - Actionable Principles for Artificial Intelligence Policy: Three Pathways [0.0]
This paper proposes a novel framework for the development of Actionable Principles for AI.
The approach acknowledges the relevance of AI Ethics Principles and homes in on methodological elements to increase their practical implementability in policy processes.
arXiv Detail & Related papers (2021-02-24T16:57:35Z) - A Unified Taylor Framework for Revisiting Attribution Methods [49.03783992773811]
We propose a Taylor attribution framework and reformulate seven mainstream attribution methods into the framework.
We establish three principles for a good attribution in the Taylor attribution framework.
arXiv Detail & Related papers (2020-08-21T22:07:06Z) - A general framework for defining and optimizing robustness [74.67016173858497]
We propose a rigorous and flexible framework for defining different types of robustness properties for classifiers.
Our concept is based on postulates that robustness of a classifier should be considered as a property that is independent of accuracy.
We develop a very general robustness framework that is applicable to any type of classification model.
arXiv Detail & Related papers (2020-06-19T13:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.